Back to all episodes

Hackathon Updates, Guests from Oso, Confident AI, and Smithery, along with AI News

August 4, 2025

Today we check in on the Mastra Templates Hackathon, a few of the judges will stop by (Smithery, Confident AI)! We chat with Oso security and, as always, we discuss recent AI news.

Guests in this episode

Shreeda Segan

Shreeda Segan

Vijay Ramamurthy

Vijay Ramamurthy

Jeffrey Ip

Jeffrey Ip

Anirudh Kamath

Anirudh Kamath

Watch on

YouTube Spotify

Listen on

Episode Transcript

3:23

Ah, hey. Hello everyone and welcome to AI Agents Hour. I'm Shane. I'm with my co-host

4:15

Abby. We're both co-founders of Mastra. Today we have a actually very jam-packed show. I say that a lot, but uh we do have a jam-packed show today. We'll be talking about the Monster Hackathon

4:28

that's going on right now. We'll be talking with someone from OSO. We'll be talking with Confident AI. We'll be

4:34

talking with Smithery. And we're going to be doing AI news like we do every week. What's up, dude?

4:39

What's up, man? It's been great. It's going to be a jam-packed episode. I remember last week was jam-packed, so

4:46

this will be great. Yeah. How was your weekend? It was good. Went to a Morgan Wallen

4:52

concert on Friday, hence the mustache. I haven't shaved it off yet. That's the next thing to do.

4:57

Dude, it's looking good. It's getting big, dude. It is. It is looking good. And um my dad in the 90s.

5:08

Did Did you get the cowboy hat? I really wanted you to wear a cowboy hat. No, dude. I

5:13

kind of crazy. I would really wanted it, but no. And if you are watching this live, you know, we we do these every week. We're live on YouTube X, LinkedIn, and also

5:25

we're on uh Spotify and Apple Podcast now. So maybe you're listening to this or watching this after the fact. If you, you know, we do share our screen a lot, so I always recommend the YouTube or the, you know, the Spotify version so you can see the video, but we are uh audio as well, so that's pretty cool. Yeah, you're going to love hearing our

5:42

voices without any context. So, it' be great. Yeah. But if you are watching this, you

5:48

should go give us a a review if you've watched a few shows before. We would appreciate that. That helps helps us a ton. Yeah. Uh yeah, I didn't do too much this weekend. I was uh pretty much just uh

6:01

just kind of hanging out. Spent a little time with some family. Did some did a little work, you know. How's the weather there in Sou Falls? It's been pretty pretty nice. Pretty

6:12

mild actually. It's not It was very hot for a while and now it's kind of calmed down. A lot of rain and storms the last couple weeks though, but overall now it's been pretty good. A little rainy once in a while, but weather is

6:24

pretty nice. Similar to San Francisco. Last week it was rainy a little bit or like foggy and People who told me SF summer was the thing I have to wait for, I'm still waiting. I'm still waiting. Maybe the fall will be better. I I always thought the appeal of SF was

6:41

that it was basically the same year round. Wasn't that supposed to be the appeal? So, it's year round.

6:49

I mean I mean you haven't you haven't sat through a Sou Falls, South Dakota winter, so uh No, that's true. I think I'll take I'll take uh the SF rain over the Sou Falls winter for sure. Yeah. Well, uh, what? So, we do have Shreda

7:09

from the team who's going to be coming on here in a few minutes, but before she jumps on, anything else you wanted to talk about before we talk about the master hackathon? Um, yeah, actually there's one small thing. Um the current YC batch is going on and um I just got like a message from some people in the batch that uh it's the

7:35

time where people go and meet PG. That's like the time that they're at. So I just had a little bit of nostalgia a little bit thinking about that because we met him too. And so I guess it's kind of like the

7:47

the certain companies in the batch going to meet him are all like nervous and and I just told them to don't be nervous because not that crazy. But I remember we were nervous. Yeah.

7:59

Well, you you go to you, you know, we stand outside and we're like, is this even the right house? We don't even know. He's running behind because it's like, you know, he he stacks all these meetings and we're like, I don't even know if this is the right house. Yeah. and we're like ringing the doorbell, knocking, no one's answering,

8:16

dude. So, yeah. So, someday we'll we'll tell that uh that whole story, but yeah, going to meet PG was a pretty pretty cool experience. Yeah. So, yeah, best of luck to everyone who's meeting him or or has already met him.

8:27

Yeah, definitely. Uh definitely good luck. Don't be too nervous. It's Yeah, it's pretty pretty chill. Just be

8:34

yourself. Yeah. Not many people get to meet him, right? So, yeah, that's the thing to focus on. But

8:40

anyway, let's get on with the show. Yeah, I mean, I see you got your copies. You know, if you if you want a copy of this before we get started, you can get a digital copy by just going to master.aibook. And you can get a copy of The Principles

8:59

of Building AI Agents written by Sam, our other co-founder. It's been called the, you know, the most popular book in SF. Yeah, it's going to be New York Times bestseller at this rate.

9:13

Dude, it's Dude, it's been wild the amount of people that have been every day I have different, you know, social media posts of someone literally it's just their hand holding the book. Yeah. I don't know why that's a thing. If you

9:24

if you're listening and you have the book, you apparently just have to take a picture with your hand and post it. I think that's like a right of passage now. Yeah. It'll get views and we'll find you. Yeah. we will find you and we will share

9:36

it because that is literally my feed is just people's hands holding the book. Yeah, same. I'm just retweeting constantly, you know. But yeah, let's bring on Shria. Let's

9:49

talk about the master hackathon and let's talk a little bit about master templates. Sha, what's up? Hey, what's up? What's up? How's it going, Shane? How's it going, Abby?

10:03

Good. How are you? I'm good. that I'm good. Are we um is Aby's mustache last week's news? We're

10:09

past that cycle. Yeah, we just talked about it. You missed it. Oh man, we're just missing it.

10:15

So sad. Uh so Sharita, we've been doing this hackathon. So first of all, most of you most people probably have seen multiple shows. They maybe have seen you. Maybe do a quick introduction. Yeah, sure. Um I'm Shria. I uh help with

10:30

content and I guess general marketing here at MATRA. Um I wrote a little piece I think for every um about my role. So you can probably just search for MRA on the every.2 page and learn a little bit more about what I do. Um I think I I was

10:47

here last with Daniel and uh we had a few interesting guests including um Rahul from Julius who I believe was just actually on the TVPN show. Um so it's like I don't know. It's really cool.

10:59

Like this is the live stream where it all starts, you know. We get them here first and then they make it big. Yeah, we we are. Yeah, we're king makers over here, you know. Aspirational aspirational someday.

11:13

Yeah, someday. Uh but no, I helped with the first hackathon. I think we had a handful of learnings and I'm particularly excited about the second hackathon. Um it's for the master templates initiative and I think what's

11:26

what's interesting here is that like you have a chance to literally define what people's starting getting started experience is with MRA like when else like do you do a hackathon and actually have the potential for like live you know real world impact like this isn't just like a demo project that you know

11:45

like that that is always cool to see your creativity and inspire people to build for their applications but but this time. It's like it's a true community effort. It's like, you know, we're open source. We're pulling on our

11:57

community and it's like here here's your chance to really shape it, shape the future of like, I don't know, the next hopeful like millions of developers who uh you know make the jump to AI engineering and use Ma. Yeah, absolutely. I think it's I think this time the hackathon is just going to be even even better than last time because of those reasons, right? It's the the idea that we are we want to

12:24

share these templates with others. You know, it's a it's a chance for your not a lot of times I think people build hackathon projects for kind of two purposes. The first one is you're you're already going to build you're already building something and so you might as well just enter it into a hackathon. Maybe you win a prize, right? I think that's pretty common. It's like you've already been working on a project or

12:43

it's a project where you want to just learn and you're going to probably throw it away. Maybe you'll use it sometime in the future, but it's kind of a throwaway project. The nice thing about this is that even if you are working on something, you might be able to pull out pieces of it and turn it into a master template. So, it's kind of could be tangential to stuff you're already

13:00

doing. But also, it's not necessarily going to be thrown away because you can share your template with others, you know, on social media. We're going to be listing a bunch of them on the Maestro website. So, we're going to take, you

13:12

know, kind of the the best ones and make sure that they're listed so others can find them. They're they're going to be, you know, discoverable. So, I think that it's it's nice that, you know, people can now install your template and use it. And so, it's kind of a way to give back if you've been looking for a reason to get into open source and contribute

13:29

to open source and uh this is this is a good way to do that. Yeah, I agree. And I the one thing I would add to that as well is that you know for winning this hackathon it's like yes of course we want to see your creativity and um you know if you're already building something for yourself and you just want to kind of refactor it into a template that's great. Um, I think it's this goes back to something

13:52

you said, Shane, which is like sometimes the best the best uh submissions are not that complicated or crazy and it's just building something that is like actually just useful like the obvious thing that isn't really there. Um, you know, you can literally think to your specific role like you know, in my case, I do content. Okay, cool. Like why don't I

14:12

create a template, you know, for creating branded uh like midjourney images or something. um that are already fit within like the company brand guidelines and like that that's a very useful template that other companies can then use um to actually make sure things are on brand that we can like ship things efficiently maybe you don't have to be blocked by a designer. So, you know, those those things that are

14:36

obvious and maybe, you know, just seem like too simple or something, they actually can become the most useful ones and then you have a great case for being on the community templates library because those are things that people will want to use. Yeah, I we had a question or John in the in coming from LinkedIn says link the GitHub repo. So, I did post a link on

15:03

the the video to master.build if you want to learn more about all the or the whole hackathon, how you can be part of it, how you can sign up. If you're not already, go to master.build and you'll see all the information. Also, if you go to our

15:18

GitHub repo, just search, you know, MRA GitHub, you'll get there. I'll post on the screen here in a second. Uh, you can also see there's a templates folder in there where we have a bunch of example templates. You can go to the master.ai/temp Itemplates page as well

15:32

to see the existing templates that we have and you hopefully you will see a bunch more here in the next couple weeks as we you know finish up this hackathon and also John has a message saying my team does highle AI systems red team operations we we could really use some good security examples for templates so maybe

15:51

uh yeah maybe there's something there agreed um yeah I guess a few other details about the hackathon we have around 200 200 people who have signed up right now. Um, the beauty of this is that, you know, LLMs don't procrastinate, but you can because you still have one full week to submit something. Um, submissions are only due Friday morning at 8 a.m. Pacific. Um, I

16:18

think that's plenty of time. Like, you know, Shane, Abby, like, how long do you think it would take you guys to build something and submit something? A couple hours. So, yeah. Yeah.

16:32

Yeah, I totally think if you spent a few hours, especially if you have any master experience at all, you could put together a template. The mo the best templates aren't even aren't necessarily the most complicated because I always look at templates as a a way a learning resource for people. So, it could be just an example that someone could start with or like a building block if you're

16:54

building a a project. So you don't if it's too specific and too complex that's almost a negative because then it kind of locks people in. They got to change a bunch more from you know from the template. But if you think of a template

17:05

as just a collection of agents or workflows that could do some kind of interesting thing. Well then you could basically think of it as this would be an interesting building block for someone to start with or someone to learn from. I think those kind of make the best templates. Um, I also think it's worth pointing out some of the judges and some of the funny

17:24

prizes. So, some of my favorite prizes are um Shane's favorite, which is always a wild card. So, whatever you know, however Shane is feeling that day, um, that will determine who the winner is for that prize. And then, uh, the

17:38

funniest one, which of course is going to be judged by Abby. He's a tough nut to crack. So, if you make him laugh, I mean, you deserve a prize for that. Um, so those are some of my favorite prizes, but uh we also have, you know, some uh

17:53

more I guess practical prizes. So like best use of MCP, best use of a tool provider, best rag template. Really, we have a lot. We have like 13 categories.

18:04

So um the odds are in your favor to potentially win something. Yeah, what's the best way to see all the categories? Okay, that is a is a great question. So we have um a slideshow. So if you look

18:16

at the hackathon slides that are linked um from this page which is just the URL master.build um I think if you go to slide 10 uh you'll be able to see all of the categories and um which companies are judging those categories. So this slideshow really has you know all the prizes, all the categories. It has a submission link. Um really it should be

18:40

your like go-to resource for all information. Cool. And we do have some great sponsors. So, we have Smithery and Recall are both

18:51

helping us with this. So, we had Recall on last week, which is really cool. Smither is coming on later today. So,

18:59

definitely interested to see more about what they're doing, but thanks to them for helping make this possible, and all all of our great judges as well. So, we got a bunch of judges and yeah, it's I don't know. It's it's a it's a pretty big operation putting all this together. So sh I gota give you kudos for helping

19:16

like wrangle everything to make all it's all good. Um I think it's worth pointing out that there's a Nintendo Switch 2 for up for grabs um that's being sponsored by Smithery and Shane if you don't mind going to the prizes slide. I think it's worth saying that like hey even if you don't win um an Amazon gift card for

19:37

these categories uh I believe it's like earlier in the slide shows like the prize pool. Oh yeah that one. Great.

19:42

Yeah. Um, we also have prizes that are just raffle prizes. So, even if you submit something and we don't pick it as a winner, you could still win a Raspberry Pi, mechanical keyboard, um, and a few other goodies. And Mossum with the YouTube comment,

20:00

procrastinated for too long, started working today. That's cool. You got plenty of time. You got all week.

20:10

Uh, yeah. I I mean I I do think last hackathon too these raffle prizes are really useful because you might just even if you you don't think your template is good enough to win. You might as well submit something because you got a you got a shot. any like realistic submission gets entered in, you know, it can't can't be nothing, of

20:28

course, but you but if you put some effort into it, you're in the pri you got a chance. And like if you live in I think like US, Canada, Japan, England, like a handful of eligible countries, um you're getting a copy of the book by default. We always hold it up. Um so yeah, that will be at your desk soon. Just all you got to do is just submit a template.

20:58

All right. Yeah, I'm excited. Any Are there any templates that either of you would like to see? anything that you if there's we had you

21:09

know quite a few people watching this maybe they're considering what are some templates that would be really that if they're not sure of an idea that would they could maybe start working on so I I can go first while while you all think of some I I put you on the spot I think that one of the things that people sometimes struggle with is authentication so I'm excited about

21:34

people maybe submitting something around how do I have you know almost like we call them sometimes we call them like dynamic agents you know how do I have my users have an individual instance of an agent that does something it's all possible with master but it is kind of complicated to set all that up and to think through that you know obby you

21:53

might have a different memory and instance of the agent than I have right that that has access to different things and it you can totally set that up with MRA a lot of people have but we don't have a lot of templates for how to do that a lot of great getting started started guides. And so I'm I'm look I'm hopefully looking forward to one or two

22:09

good examples that would showcase this is at least a way a technique for setting that up because I think that would help a lot of people. Yeah, that's a really good one. And that's one we get questions about.

22:21

Honestly, for me it's like all the templates that are like we get questions about. So like if someone does deep research or you know how to do rag pipelines and stuff like um it's one to read docs but it's another to just get code for free and so if you can provide that value then that's dope definitely um I think a lot of uh companies as they grow um it makes sense for them to write

22:49

case studies so I would love seeing like a use of agent network where you have you know like even before you go into your interview with your customer, maybe it does some research on them, helps you generate some specific questions to ask them. Um maybe once you do your interview, you you know upload the link

23:07

or a transcript and like before it even like drafts something, it just goes through with you and it's like is this the right outline? Are these the right points to talk about? Helps you draft maybe an editor agent. So I I I don't

23:20

know. I I just really love agent network. I just want to I just want to build with that. Yeah.

23:30

I I think another one, this is more personal for me, but I'm, you know, I've done a lot with image generation and audio, you know, text to speech. So, anything with voice, anything with image gen, those I think are just cool examples. I think everyone thinks of Maestra as the if you've just getting started with Maestra, you think of it like a chat agent, right? But it can do

23:55

a lot more than that. So I'm very interested to see more of the multimodal types of examples specifically with workflows and and all that. You know what we previously released, you know, the the idea of like this storybook generator. And so I think there's someone maybe working on that I've

24:12

talked to working on taking that and expanding it into a template. But other things around that I think would be really cool as well. Selfishly I just want cool agents that I can like show my kids like here's a cool story built by an agent. Let me tell then I don't have to be as creative. I can you know offshore some of the creativity to some

24:31

kind of template and then I'm the cool dad you know because I have all these cool stories. I don't know. I mean, I think the sky's is the limit. Like, you know, for personal productivity, personal life,

24:47

it's like, why don't I have a nutritionist and personal trainer agent that are working together? And, you know, why don't you even add like a stress management thing in there? Because like all of those things, you know, relate to each other. So, I I

25:00

don't know like Shane, like this is I feel like uh there are endless use cases and possibilities here. Yeah. Yeah. Well, Shita, do appreciate you coming on. I think Obby and I are

25:14

probably going to just poke around with some code. So, if you are watching this, you should be watching the video version. Hopefully, we're going to talk through how you can how simple it is. Maybe we'll look at a template or two in our repo. So, if you haven't looked at

25:26

what a template is, you can see it's actually not too bad. But any parting words before um I'll just say you know the hackathon the hackathon is just the beginning um I won't say too much but uh there are so many more opportunities to get involved with MSRA um and we have some exciting stuff down the down the line. Uh maybe look at maybe keep an eye on our Twitter

25:52

uh later this week. Yeah, we we have a a big uh a big launch or big announcement coming up here pretty soon. So excited about that. Can't share more quite yet, but that's

26:04

going to be you heard it here first to keep an eye out. That's all. Yeah. Yeah, it's it's coming. All right,

26:10

Trita, thanks for stopping by and good luck with the rest of the hackathon. Thank you. Take care. Take care, guys.

26:20

All right. I'm excited for templates, man. That's what you said. Yeah, I'm I'm just excited that you can

26:26

quickly get started and it's not not everyone getting started with MRO starts with the same weather agent experience. Nothing wrong with it, but I think oftent times if you're used to using frameworks, you kind of have an idea of what you want to build. So, if you can get a little closer to that starting

26:43

spot, I think it goes a long ways. Yeah, we do have a question in the chat coming from X. Uh it's Vivec. He says, "I had a question about agent network. How is it

26:55

different from normal agent with tool call limit and has sub agents as a tool?" Um in the grand scheme of things, they're not different at all and they're not probably agent network won't exist in the future, but it was a way for us to experiment um how we're going to do multi- aent systems. We use workflows under the hood, not necessarily using

27:19

um loops from other libraries. Uh so it was a good experimental thing. We know how we want to do it and so probably the next iteration of that will just be an agent like it's not going to be anything new. Um but yeah, that's there's in the

27:38

grand scheme of thing there's no difference. Right now there is a difference cuz it's actually we built it. Agent network is better than just doing an agent by itself with different tools. Uh it is experimental though so

27:49

maybe it's not for your use case. My hot take is it is better for most use cases because you don't have to think about all the details of wiring it up. Yeah. But if you want more control, of of course you can accomplish the same things and maybe have a little bit more

28:07

uh you know flexibility and control if you did it yourself and you controlled like when should the agent call the tool, you control the system prompt, you control the tool descriptions. So if you want that level of control, yes, just an agent with sub agents as tool calls is a way to get that level of control. If you want to be able to just wire up a

28:26

network really quickly and you still have control of course, but maybe you don't have quite it's not quite to the bare metal that you have on the the other version, agent network's a good good way to just get started. Yeah. Another like technical detail like streaming within an agent call streaming an agent within a tool call, you're

28:44

going to lose the stream data, just the nature of it. We're working on fixing that and making it better just in general. But uh yeah, there's a reason why the there's a reason why we built it. It's just you'll see soon enough, I guess.

29:01

And Marcelo also post a message on X MRA with React Native. Is it possible? Anything is possible, Marcelo. Anything is possible. Um using MRA within React

29:15

Native may be tough. There's a lot of server libraries. I don't necessarily know if they're compatible or can be like, you know, browser fight or whatever or how that works. Um, but if

29:26

you're using Monster as a server, you could do whatever you want because that's just typical React Native development. Um, we are looking into like how can you use it in these browser environments, but not right now. Yeah. So I think it's very easy if you just want to use the master client and

29:44

you're hosting your agents somewhere else, you know, with us on mastercloud or in some other server provider and then React Native is just the client into those agents. You can definitely make that happen pretty easily. Yeah.

29:56

And we've seen other we've seen people I'm pretty sure I've seen a React Native example, but also like Electron and you know other other things expo. Yeah. Cool. Should we look at a template or

30:09

two while we wait for for our first guest? All right. So, let's go and we're going to browse. We're just going to come at it fresh with fresh eyes

30:21

and and you're going to pick an you're going to pick a template and then we will look at the code and review it together. How's that sound? Sounds good. All right.

30:33

Here we are. We're in the master templates gallery. There's only six now. There's more to

30:39

come, especially after this hackathon. We will be adding obviously community contributed templates to this page. Let's look at the browser agent. The browser agent. Okay. So, browser

30:51

agent uses browser bases stage hand with MRA. That way you can basically wire up and have it do browser actions for you. So, let's take a look at the actual repo.

31:14

Okay, so the first thing you'll notice about a template is, you know, it's listed as a public template, which is great, but it's just a pretty basic MRA project. Has a source folder, has a readme, has an environment. file, and of course, as you'd expect, has a package.json, JSON a TS config.

31:37

Any other comments on first impressions? Pretty simple. It's pretty simple.

31:43

So, if we go into the source folder, if you've used MRA at all, you'll not be surprised if there's a MRA folder here. There's also a lib folder which has, you know, some helper functions. Inside the MRA folder, we have an agents folder. We have a tools folder. And we

32:02

have this index.ts which is going to you know by def always have to basically export this master class. And you when you're actually building a master project you pretty much have flexibility to do whatever you want. It doesn't have to be in this master folder. But with templates we kind of enforce a little

32:20

bit more structure. Meaning if you want a template that's listed on the master website you need to kind of follow some basic conventions. And the con these conventions are listed in our contribution docs on the docs website.

32:33

But it's pretty the pretty much basic conventions that you would get if you started a master project. You don't have to follow those of course in your own projects. But if you want to make a template, we do ask that you follow this kind of agents folder, tools folder. Yeah, an index file that exports the master class out of the main source

32:51

master folder. Yeah, we do want people for templates especially like you want the user to always be able to be familiar right within any template because you know I know y'all be naming weird and stuff. I do it all the time. Like you don't want your user to be like what the is this? Uh what

33:10

what folder is stuff in? I'm not used to this code structure. Uh so you got to make sure it's the same.

33:17

Okay. And if we go look in our agents folder, this is pretty simple. There's one agent.

33:22

So where you can look at our web agent here. Wonder if zoom better zoom a click. Yep.

33:30

So it's only 35 lines of code in here. So really not much. I think the most important thing is we're importing a couple tools and we're giving those tools to an agent. You can see here we have a system prompt which is pretty simple. You know, if you were actually productionizing

33:47

this, your assistant prompt would be significantly larger, right? But as a template, it's a good starting spot. It's using OpenAI's 40. What do you think? Should we look at the

33:58

tools? Yeah, let's look at the tools. All right. So, we go into the tools folder. As you'd expect, one, you know,

34:06

file for each tool. Which one should we look at first? Let's do the extract tool.

34:12

All right. walk me through it. Cool. Let's go some clicks though.

34:21

Cool. So, in this in this example, uh this template is creating its own tools, which is great. So, this shows you that you could do whatever you want with Monster with JavaScript. And in this case, we're using stage hand. So, here

34:34

we have a create tool. This allow this creates uh this turns your code into a either tool compatible or workflow compatible uh code in Monra. You give it ID, descriptions. Descriptions are super important especially for agents. Um this

34:51

one's pretty simple. U and then input schema that is pretty self-explanatory. Uh and then you have an output schema. What is the tool results in this case? It's like you could have so much

35:03

unstructured data uh from this tool that you could just do any and that's just the way it is. Um and then within it we just call you know perform web abstraction which hits up stage hand and then extracts data which you'll see could be a bunch of in there. Yeah. And it's just a function right. So

35:22

if you look you know we're loading stage hand in from the session. We're calling the stage hand extract method with the instruction and it extracts data from that web page. So it uses stage hand under under the hood uses the library to basically just extract data from from a site.

35:41

Yeah. And all the other tools you know are are very similarly formatted, right? One is to navigate between pages. So you know create tool if we look down here it's

35:53

probably just going to call page.go. So it's just navigating, right? So this is like a navigation tool.

36:01

We have an act tool, which would I'm assuming is by definition like you're acting on the page. So you're clicking a link, clicking a button, filling out a form, something like that. So perform some kind of action.

36:14

And then the last tool is observe, which I believe is just to like look at the page and uh observe any elements that are on that page. This is a great example of a template. many reasons. One, this template shows

36:28

something that many people want to do, which is have an agent navigate the web, right? That's one thing. Two, it does it in a simple way because now you probably understand it's not that difficult to have an agent navigate the web in different ways. And then three, if you were going to build some type of browser agent type thing that's very specific

36:48

for your use case, this is a great starting point because now you could you can edit this instructions. You could add more tools or make make it more specific like your browser tool knows the website that you're on cuz you're the one building that thing. So like that's why templates are super powerful.

37:08

So that's a template pretty basic pretty simple in that case. You know, I we did have a question on, you know, after we released this template, why don't we use the stage hand MCP? And mainly because we base this on an example we built way before the stage hand MCP was kind of finalized. We could use the MCP, but I also in some case I kind of like that it

37:29

doesn't use the MCP because it shows an example of you can just write functions to have it do things. So yeah, we have the SDK for stage hand. we can just call methods in the SDK. It would be a little easier maybe to use the MCP and we might update it in the

37:46

future to use the MCP because I think it'd be a little less code. But if there's not an MCP, as you can see, you can just write your own functions for tools. And if it was like let's say stage hand was only in Python, then we would be using the MCP. But because we're in Nodeland and they have a library, why not?

38:03

Exactly. You don't have to use MCP, by the way. You can do whatever the you want. This is what I'm saying.

38:10

All right. Uh, and with that, we're going to move on. We have our first guest here.

38:16

Awesome. So, we're going to bring on our first guest of the day. We have Vijay. VJ. I'll have to ask how

38:24

to pronounce that. Uh, from Oso. So, hi for having me.

38:31

Yeah. Welcome to the show. Welcome. Welcome. Welcome. Can you tell us a little bit, you know, just

38:37

introduce yourself maybe and talk a little bit about OSO? And I know Abby and I are gonna have lots of questions and you probably want to show some stuff. Absolutely. I guess the first answer is uh you got my name right the first time. It's the Jay and then um OSO is um

38:50

authorization as a service and um obviously with uh MCP agents and and uh like new tools that can do things that at a scale that humans couldn't have done before, authorization is just at the forefront of everyone's minds. um OSO start is a generally just a tool for build making it easy for building access control into any application and

39:15

um in the AI space we're exploring things anywhere from how to build secure authorized rag to how to make the developer experience with OSO easier um via things like the MCP. Yeah, I know that O is always O has always been important, right? that that I'm not saying anything anyone doesn't know, but it seems like people are even more like scratching their head about it

39:38

or more confused on how how to think about off when you have agents that either might re, you know, might do something on your behalf, you know, do they authenticate as you? Do they authenticate as some kind of additional, you know, being that has some different level of permissions? How does that how does the approval of that happen? I

39:55

mean, there's so much in the off space. There always has been, but it feels like there's even more confusion and need for tools to help help people with how to like how to navigate O in general with with agents. Absolutely. And um it's easy to to like come to simple conclusions like um

40:14

you're you're you have an agent acting on behalf of a user. They should just have the permissions that the user has and inherit those. But um then you look at things like let's say you're unleashing an agent on your AWS instance. Like I'm going to give all of my engineers um like unfettered access to our prod environment because they're

40:33

engineers that I trust. But I wouldn't give an agent the same amount of access that I have access to um because I don't want them taking down prod or something. And so it's not even a simple matter of inheriting the same permissions. And so

40:44

there are things like um kind of controlling what kinds of permissions you want them to have. But another really interesting space is kind of in rag. Um there's this new problem coming up where um you it's kind of come up in the search space before. Um companies like Glean for example are kind of built around um how to access how to control

41:05

access to um AI based search over um large quantities of data uh where the data comes from thirdparty systems that have their own permissioning model as well. So if I want to set up uh if I want my application if I want to build an application that can have thirdparty data like from Google drive or from

41:23

notion or um something that has its own access control model and then I also want to be able to do rag based search to enable a chatbot or an agent to kind of answer questions based on that. Um, it's kind of a tricky architectural problem to p push the permissions down into that kind of semantic embedding

41:41

retrievalbased search in order to give the chatbot access to only the context that should have access to. And that's where OSO's local authorization um kind of API comes into play. So, I have a little bit of a demo that we can get into whenever we want. Yeah. Yeah. Any any questions before we dive right into the demo, Obby? I got a

42:00

thousand questions, but I guess I'll do the demo first and then we'll we'll go from and then we'll have a thousand and one questions I'm sure. Perfect. Um, so um, local authorization is an API that OSO has had for quite a while. It's kind of an API for, um, I have have like a kind of combo demo

42:18

here. Um, can you see my screen? All right. Yeah. Can you give us uh, one or two clicks of zoom? Yes.

42:25

So, that's better. Yeah. This is our DSL for um defining an authorization model. This

42:30

is some sort of um HR document organization um um application for example. And so you might have departments within an organization and then different departments own documents. So the document has a creator and a department and um we have these rules like let's say that if you're the manager of a department then you're a

42:52

reader on the document and then if you're a reader on the document then you can read the document. We don't need to go too far into this policy, but it's just an example of a simple app that somebody might build. And um when you have a large number of documents in embeddings database, so in this case, I'm actually using PG vector and I'm

43:10

using OpenAI's text embedding 3 large model to to generate embeddings for these. And I have data in here of the nature of um they belong to certain people, they belong to certain documents. For example, we have these two D departments. Engineering department is ID1. Department uh HR department is ID2. And if I just spin up

43:32

a little chatbot here, um we made this little chatbot as like a way to demonstrate how um you can use OSO and local authorization to kind of control access to um what documents uh the chatbot is then um fed as context. So let's say I'm authenticating as user Jane and user Jane is in the is a manager of the engineering department. Let's see something that she should be able to to know about. Um so Jerry is

44:00

also in engineering and he's working on a project called the great escape. So let's ask the chatbot what is Jerry working on? And that looks good. And then let's go

44:12

ask something else. Karen is 49 years old but Karen is in the HR department. So, if we ask um how old is Karen, we don't have that information. Let's

44:30

say I switch to a different user. Um let's become George, who is the manager of the HR department. So, if I'm George and I ask, um what is Jerry working on?

44:51

George is not able to access the information. He is able to access the fact that Jerry and Karen are dating, but that's because that is in a document that Jerry himself created. And then if I ask um how old is Karen?

45:09

Karen's 49 years old. So I'll pull back the curtain of how that's implemented and why it's kind of a difficult problem to implement in the first place. You have all these documents that you have various levels of access to and the architectural problem becomes how do I store the information and query the information about who is able to access

45:28

what what document. This is a small example where I have five documents in my database. But of course any real use case would have thousands or millions of documents. And um it's not feasible to

45:39

go ask an authorization system, hey what are all the documents I have access to? Get back half a million IDs and plop that into a rag search. It's also not feasible to store in your database um a column on each document saying here are the thousands of users that have access to this document. So you need to do

45:57

something a bit more clever. And I was able to kind of piece this together with um just the uh the polar policy itself which is completely agnostic of where the various pieces of data live as well as the um OSO local authorization feature which we're using in um this API call right here. I'm actually using the SQL Alchemy extension that OSO

46:23

built for local authorization. So this actually makes it um uh make allows you to use local authorization with the exact same ergonomics of SQL alchemy itself which is a popular OM for for querying databases like Postgress with PG vector. So here I'm doing my L2 distance search for for content embeddings. And then what OSO allows you to do is you just add this authorized filter. You chain it in just like all of

46:48

uh SQL alchemy's other fluent APIs. And you get back just the documents that the user is able to access. And I'm actually I have this hooked up to a tool that we built called OSA migrate. So you can actually dig into this and see like why

47:01

was I allowed to um let's say we're we're Jerry and I want to find out why was I allowed to read document 5. We have all these local facts and then we also have server facts and that's the key to this. Um, if I were to go over to my MCP and ask uh like what has RO facts are stored in OSO? Um, this is an experimental MCP that we're building and

47:33

this allows the a tool like cursor to just query the OSO API directly. And you can see that we had a million, not literally a million, but we had a lot of facts about documents and user assignments and roles and stuff that we were able to answer the query with. But we're not actually storing all of them

47:51

in OSO itself. In OSO, we're only storing department level access. But if I were to debug this query, I can see that um user 3 has the read permission from this has ro fact that has role was granted from the document belonging to the department and then this document belonging to the department was stored as a fact in OSO. But this has relation

48:14

fact that was allowed to close the loop between the department and the document. That's the part where we get to cut down the amount of cardality that we're actually communicating between the authorization service and the Postgress database by storing this in the Postgress database itself. And so that's

48:32

where you have like a any given user is going to have access to maybe like five departments at most, but each department might have thousands and thousands of documents. And OSO's local authorization kind of allows you to piece together these different sources of data in a single breath. and also figures out where to get each piece of data so that

48:51

it can construct a really efficient query against the the database that you're searching over to power um a search like this. That's pretty cool. We use oso ourselves um and use local the local queries as well. Um I do wonder though like this is in the point of view of a user that's

49:20

doing something through their agent. So the agent is acting on behalf of them, right? So they're inheriting the role of what whoever is asking the question and then applying those rules, right? That seems like a solved problem already

49:34

though, right? Because this is not something new like this is essentially tool like locking down tools. You could do it the same way through OSO as well, right? Like I only have access to these tools if the user's policy matches

49:46

whatever document I have or whatever. Um, I think more so the authorization problem comes down to like dynamic situations where you have to create a policy on the fly. Uh, or like you're doing stuff on behalf of someone you need to identify first. like you identify the person then become them then you have access to things or

50:13

whatever. Are you all going in that direction where it's more agentic authorization? Um yeah absolutely like uh the these kinds of temporary access based things are things that OSO kind of uniquely allows for such as contextual information that um in conjunction with this data that's stored in Postgress and

50:32

stored in OSO. Um, it's very common to kind of do on the-fly access grants for situations like impersonation. Let's say like there's a like a an old school example of this would be you're a customer support agent and you need to impersonate a user in order to see what the UI looks like from their end and and

50:48

do thing do actions on their behalf. But um what as agents start to kind of take on roles like customer support, you can imagine using similar contextual information as well. when you are granting temporary access to an agent, um you would also want to make sure to lock down the time frame by which you

51:05

give the agent the access so that it kind of minimizes the blast radius. And so OSO has expiring access functionality that you can use. So if you were to insert this temporary information about uh when when an agent was granted access, we can um you can just configure in the policy saying that any agents that are granted access um have access

51:25

that expires after a certain expiration. And um there are kind of two philosophies for kind of sharing permissions this way. Um one is called sharing, another one's called delegation. And um we find that

51:37

delegation tends to be what's what's safer to give agents access to. If you were to delegate access to an agent and then the person who delegated the a the access to that agent loses their access, the agent should not still have its access. And so those are the kinds of things that you can kind of piece together with a policy like Polar. and

51:56

then piece that together with kind of wherever the data needs to live. What about roles specific for agents? So it's no delegation or sharing, it's actual roles. Are people doing that?

52:08

Yeah, I've seen um uh so so one thing that that OSO allows for is modeling different types of actors. So it's not just a user is the only type of actor that can have whatever roles that they have. Um, I I've seen various of our customers um have like a a user actor meant for human access and then also another actor type entirely that has its

52:31

own rules and logic that pertain to it that um represents a machine for either API calls or applications or for agentic access and um allowing you to kind of share however much of logic you want to between the human users and the agents while also getting to write logic that's specific to agents such as uh maybe you only want to grant maybe they might have the same role as a human but that role

52:56

grants fewer permissions or you might grant agents entirely different roles altogether that come with their own kind of permissions depending on that how dude it gets even more crazy when the agent ooths into something and gets permissions from a third party system too right like are y'all thinking about

53:14

consolidating access like that or is it like that's going to be their and your stuff is whatever your policies control. So, so that's one of the the most interesting use cases we've come across that kind of relates to the demo I just gave where um this is a really common use case. Um the example I showed is like this is all first party data but

53:33

like a really common use case is doing the same kind of search um over over third party data and at that point it's not even as simple as oing in to to getting the documents. Let's say an agent wants to get a Google Drive document and they want to give that information to the user that's using that. That's not as simple as just hitting the Google Drive API and like passing along an OOTH token on behalf of

53:57

the the user who's initiating the request because a lot of the time you're not doing individual document requests. you're doing something like rag where all these documents had to have been processed into embeddings and they all got like kind of mixed into this big soup of a database that people have mixed access over. And so um that that's

54:17

uh uh we've talked to multiple companies who who have exactly that use case where it's thirdparty data in a rag based uh system and they basically I've seen uh without OSO I've seen combinations of either choosing not to build certain functionality because it's just hard to filter the filter the permissions or um

54:40

really inefficient uh ways of filtering this such as the thing I mentioned earlier about where you do this pre-flight ooth check for like every combination of user and document and then store those IDs inside the the the database rows or you do your rag search you get back a bunch of potentially unauthorized documents and then you have

54:59

to loop through all of them and do the ooth requests for each one and then maybe you get back none after that and you have to go back to the database and get some more. So that's where it gets helpful to kind of have this relational data stored across like an authorization system and and uh um relate things within a database like PG vector

55:19

that that that is something that is interesting when you think when you really think through the problem because yeah if if you are just going out to the API and getting the data on the fly and just passing that into an agent. Okay, maybe just passing your OOTH token and kind of basically passing that permission on to the agent is okay. But as soon as you get to the point of, you

55:39

know, needing to sync data, which often happens, right? If if you're building a AI email client, you're not just going to query the, you know, Gmail API every time. you're going to sync that all those emails and then you have to care about how does the permissioning work in that case especially if you have teams and sharing and you're building some

55:57

kind of sales AI assistant and you got to you know have the embeddings and figure out the chunking strategy and then figure out how do you actually determine who accesses that. So yeah, I think O gets more complicated when you actually are building real systems, not just, you know, querying data on the

56:15

fly, but figuring out how you sync and store and and then retrieve. Absolutely. I've talked to companies building exactly these systems where they're complex like event- driven pipelines for synchronizing data from thirdparty APIs. And I I've they've told me that the hardest part of those data

56:32

pipelines, you would think the data pipelines are mostly oriented around like um doing the embeddings like like hitting the the LLM API and and generating embeddings and storing that somewhere, but the thing that complicates it a lot is um keeping track of the permission information and then you have different bits of permission information that come in from various

56:51

web hooks across the the app surface and then you have to do something really relational or or like really intricate to actually apply the the the different pieces of information in the data pipeline and then have like a processor later on in the pipeline to kind of resolve those permissions. Oh, so sick by the way. Just want to let you know. U I think you already knew that. But uh dude, the thing is cool is

57:19

like what did Karpathi say Shane about layer? Was it level one, level two, level three or layer one? What did he say? Like the different types of applications? You remember that Yeah, I mean I remember but I don't

57:31

remember enough. My level three or level two or something like something like that. Anyway, like the authorization problem is a will always be a problem no matter what software we're trying to build, right? We're doing agents now. Before we were doing microservices where OSO stuff

57:48

became really super helpful for us like back in the day um which is not that long ago, but like it's the same kind of problem in a different sense, right? Like a different like place right now. And it's just exactly what you said, Vijay, like either people are not doing it because it's too hard. two, they're just who cares? They're just

58:08

yoloing it and it has super bad performance or they're not doing anything at all and they just don't care. So like that's just unauthenticated and unauthor authorization just come on in. Are y'all seeing like the customers getting more evolved now? Like are people coming to

58:25

y'all and being like, "Hey, we're finally at the place where we need to lock this down. Like we need your help." Yeah, we we actually had multiple customers just start to use OSO for locking this kind of stuff down without even telling us and then we found out about it afterwards. And um yeah, like

58:42

uh I don't know if I've seen a customer just like uh forget to lock something down. We've definitely had customers kind of start to think through their AI and come to us with like um they just want like to understand how to architect this stuff so that they can fit permissions into the equation because it's super easy to like go really far

58:59

down a path of architecting like some fancy AI thing for like a prototype proof of concept capacity and then you get to production you realize the one thing that doesn't fit into your architecture at all is the permissioning story. And so um we we have um customers doing like very like local authorization

59:18

based based filtering or just uh doing kind of inefficient API queries to us where they they get a bunch of stuff back and then like do the check check the check to us. Um and then other people you've talked to um uh like who aren't using OSO yet um have like this kind of filter on what products they feel they are able to build and then

59:38

when they realize what OSO can kind of do it it kind of unlocks like new kind of AI based functionality. Yeah, totally. That's it.

59:51

Awesome. M uh it's great kind of talking to you. anything else that you want to mention kind of before we wrap this thing up? Yeah. Um the I I showed a quick snippet of our MCP today in this demo, but we're

1:00:05

hosting um an MCP demo night in San Francisco um tomorrow co-hosting with Sentry, who also has an awardwinning MCP server. So, um that's going to be fun. So, anyone who's out there in San Francisco should come to the MCP demo night. Um I could share the link in the chat or something. I don't know how drop it in and we'll we'll share it.

1:00:24

I'll see you there tomorrow. Cool. Looking forward to it.

1:00:29

All right. Thanks both. Yeah. Thanks for coming. See you. Yeah. Thanks for coming. And we'll we'll

1:00:34

share that link here for anyone watching. We'll drop it in there if you want to. It's kind of tough to type that in, but uh if you go to Luma, you will you will see it. MCP demo night tomorrow. Thanks, VJ. And yeah, I'm sure we'll

1:00:58

chat soon. Thanks, R. See you.

1:01:03

All right, dude. Oh, cool. One guest down. Yeah, OSO is pretty cool. I got I haven't obviously dug into

1:01:10

it as much as I know you have, but it OTH is never the the worst part of anytime I've built an application is always just thinking through Oth. It's the worst, dude. And then as you get more complex it it doesn't it never gets easier. So having tools especially now that we need to think

1:01:30

through again the problem set hasn't changed it's just authentication but I think agents add a nice little wrinkle that makes it uh something else you have to think about. Yeah I would use OSO in a nonAI application as well as an AI one. So you all should check them out. Absolutely.

1:01:48

All right. And if you are watching this, this is AI agents hour. You might be uh watching this live. So if you are on X

1:01:55

or YouTube or LinkedIn, you can just drop a comment. We will show it on the screen and talk about it most likely. If you are watching this or listening to it after the fact on Spotify or Apple Podcasts, give us a review. We like the

1:02:09

five star reviews. If you don't want to give us a fivestar review, maybe just don't review. I you know, please do that for us. But if you do want to give a

1:02:16

five star review, we'd appreciate uh appreciate any of those reviews. Yeah, no trolls. Just give us some.

1:02:22

It helps more people uh find out what we're doing and uh we get even better guests coming on and we are able to talk even more about building AI agents. So yeah, it gives us a you know it it also feeds our ego a little bit. You know, I like to see the five star reviews like, hey, what we're doing matters. Yeah. Even though all the five star stories we have are from ourselves, you

1:02:46

know. Yeah. It's like it's like, hey, all our friends, go give us a five star review. No. Uh yeah, we are we're having a good

1:02:54

time though and hopefully you are all as well. But we do want to bring on our next guest. So we are bringing on Jeff from Confident AI who is also going to be a judge at the hackathon that's going on right now. Obviously, we talked with Shreda earlier about the master hackathon and the templates that are

1:03:13

getting built and we'll learn from Jeff what is confident AI and maybe we'll talk a little bit about the hackathon but more importantly talk about what Jeff is doing and what confident AI is doing. So Jeff, welcome. Hey, good morning guys. Long time no see. Yeah, I haven't seen you since YC. How's it going?

1:03:30

Yeah, it's great. Uh you guys in the Bay Area right now? Yep. Where I am? Uh Abby is I I will be

1:03:37

next week. This time next week. Hey, we're gonna do an inerson show next week, Abby. That'd be cool. Hell yeah. We're into that, D.

1:03:45

Yeah. Are you still in the Bay? Yeah, we are. We're in downtown right now, actually.

1:03:51

Nice. Yeah. How's Master doing? Good. Yeah, things are going

1:03:56

now. So, that's pretty. Yeah, we're we are here. We're we're we're still shipping, you know. We just

1:04:01

keep shipping. But Jeff, for those obviously most of the people watching maybe haven't heard of you or don't know what Confident AI is, can you tell us just a little bit about yourself and then also about Confident AI? For sure. Um, I'm a co-founder at Confident AI and uh we do LM evals and

1:04:21

uh we open source a package that basically does pi test for uh LM evaluation. And so you imagine you would plug our package in the CI/CD pipeline. You write some test cases and then our metrics will do some validation for you and then benchmark your application at any point in time and these can be chat

1:04:43

bots, agents, rank or just the element themselves. So whatever you're building, we found a way to basically integrate with them and evaluate them. Good. Awesome. Yeah. I I know eval has been a top hot topic just in general. You know, we we

1:05:01

talk about it seems like almost every week or at least every other week where we're talking about, you know, whether it's just news around people, you know, like writing evals, how to write evals, struggling with evals, people not running evals. But I am curious on what so you've obviously talked to a lot of people, right? You talked to a lot of your customers, a lot of your users of the open source package. What are some

1:05:24

of the things that they struggle with with eval? And you have any like just general tips of when people are starting the process like when should they start considering evals and how should they go about you know actually building evals. I think eval it's something for products with product market fit. Um the the

1:05:46

essence of evals really works when you have a data set that's static and it doesn't really change a lot when you're still iterating your product for example. It's really hard to even spend the time let alone make your evals effective. And so the first uh pitfall that we see very often is people want to run these evals but um they might not

1:06:07

have enough time dedicated for it. Um, and that's really a block. And so, uh, first struggle really the data set isn't there or they're not sure kind of like what the data set should look like. Um,

1:06:20

and then the second thing it's choosing the wrong metrics, which we can talk more about that later, but essentially like having too many or too little metrics. For example, a startup shouldn't in most cases have for example safety metrics like bias, toxicity. Um, they might have something more practical. And although it looks nice, it's best to not include them because otherwise you mess up with those data.

1:06:44

Um, and it becomes really hard to look at instead of instead of having one clear north story to iterate towards. Well, it sounds like you're kind of suggesting that Yeah, because I know as part of like deep eval and confident AI, you have a bunch of kind of what what we like to call like off-the-shelf evals, right? things that might be interesting to add, but also I believe you you still

1:07:07

allow people to kind of write their own evals, too, right? How do you how do you think people balance between the two of like just choosing to use some of the off-the-shelf evals versus spending the time to write their own? And what would you recommend there? Yeah, offtheshelf evals are like generic for a very good reason. It's fast, it's

1:07:27

easy to use. Um, and they work really what like like we look at an an AI app right now, there's really like two dimensions. The first dimension is the use case. And so, for example, it might

1:07:38

be for legal firms, it might be in the might be for like a hospital setting, chat bots, stuff like that. Whereas the second dimension, it's really about the architecture and the way it's built. And so, these can be multi-turn, single turn applications. These can be for example built on OpenAI's APIs or using a

1:07:57

framework like Mastra. Um and so the generic one works really good for this the ladder. Um the non-use case specific ones and you really want like two to three eval centered around for example if you're building rag you might want to do some relevancy checks on the generator. You might want to do some like contextual checks on like the

1:08:18

retriever. Um whereas if you're building agent, you might want to employ some component level evals on each step of the way. Um and so these are generic um recommend around two to three of them. And then it comes to like the use case

1:08:31

ones. Um for the use case specific ones, those are like the custom evals where you really want to for example have some custom criteria and then the value against those. And it's really the combination of both that people always use. And um they're actually more custom evals being ran in production right now than uh the generic ones.

1:08:50

Good. Offtheshelf evals like they don't really I in my opinion they don't they don't really do anything for you other than the second problem space which doesn't really matter if you're in the PMF territory. Who cares about your I mean for example the law firm case like maybe bias does matter but it's not going to be the offtheshelf bias. It's going to

1:09:13

be the one that you wrote is about how your company deals with bias as opposed to for sure. Yeah. Right. So yeah, I mean people just like also like to write their own evals a lot more. Um

1:09:25

but but yeah, totally true. I think offtheshelf evals are less and less useful. Uh and they're moving more into custom evals and then um how to customize that and etc.

1:09:38

Yeah. Yeah. Yeah, I mean I think the idea behind offtheshelf evals is is a good one, right? Like people want to get

1:09:45

started quickly. They want to be able to just like see some data and so maybe there's some usefulness to like understand how eval works and maybe there's one or two that make sense for you. But I do think the real value like the business value is in the custom evals, right? Like spending a little time and figuring out what's the measure that you care about the most. How do you

1:10:03

measure that? And then how do you track it over time, right? I think those are kind of much more important.

1:10:10

Yeah, for sure. And like I was going to say they just correlate with business ROIs a lot more and so it's also much more convincing for even like nontenical folks on the team to get started with and they also can they can also contribute to it like they can write it in everyday English and so that's a a big plus as well.

1:10:31

Do you have these problems with users that have like a chicken and egg situation where let's say they're coming from a different industry like that was already existed like the law let's use the law example like there is a law firm that exists or whatever now they understand how their business works it's been running for 20 years but they want

1:10:50

to have an agent so they build one somehow but you know under your criteria right now they don't have any data sets yet but they probably they could probably they could probably still eval it though, right? Like they don't necessarily need a data set. They could take any new production trace that comes

1:11:09

in and start evaling it. So like what's the right time for you to actually say, "Oh I have a data set now. Let me sit down and do this." Or is it like you're just always trying to think about

1:11:20

doing it as the train is running? Uh I think there's really two approaches and that's like a great question because the first scenario is when they already have the data set maybe they have a dedicated team working on it like some some QA teams and some bigger companies for example and those are really easy to

1:11:38

get started with. Um for do folks that are like um trying to like still curate their data sets, uh we offer them like a tool to basically uh monitor their AI apps like tracing and production and then we can just enable the evals for all the different steps and then we just intelligently select the points that we think they should add it to their data set and then they can just approve in the end. And so that's what we recommend

1:12:04

right now for folks that um we think might not have as much time as we hope to kind of get started to sit down and curate these data sets. Um it's it might not be uh as um robust as if you were to dedicate like someone full-time working on this, but it's definitely a really great method just to get started with

1:12:24

something and then once you have the data set in, you can work with it um as as as you go along. Yeah. Do any of your Sorry, this is my last question. Do do any of your customers like I feel like sometimes

1:12:36

eval are like enterprise initiative rather than for actually like improving your AI. Do you get companies that come up to you and be like, "Hey, I need like 10 offthe-shelf evals so I can like check this box for my AI initiative." Um, we certainly do get those. Uh,

1:12:56

they're just not really great customers, you know what I mean? Yeah, dude. Yeah, we know the type as well. Yeah,

1:13:06

we see them in the first go, the first five minutes. You can tell. Yeah. Yeah, for sure. Well, Jeeoff, we

1:13:13

we appreciate you coming on. And so, you are judging the eval category of the master.build templates hackathon. So,

1:13:20

definitely excited to to see what what you think of some of the people that are building with evals. I know I talked to at least one person who's building some kind of template that's trying to show off how eval work. So definitely looking for your uh you know your takes there. Uh anything else? How how can people uh

1:13:38

connect with you or or follow what you are all doing? Uh well let me link or refill. I think that's the best way um to follow what we're doing. All our shows are there. Do do

1:13:50

you know where I should paste it? Uh drop it if you drop it in the private chat. Yep, I got it. So, I'll drop it

1:13:56

in. If you're watching this on YouTube live, you'll see you'll see it. Otherwise, uh look at the screen right now. You'll see it. So, it's DB Val by Confident AI. So, go uh go give him a

1:14:09

star. Give him some love. Yeah, give it a star. It's great.

1:14:15

All right, Jeff. Well, appreciate you coming on and yeah, I'm sure we'll talk to you soon. Yeah, good to see you, man. Thanks.

1:14:21

Yeah, see you. Thanks. All right, dude. Evals, it's a it's a

1:14:28

thing a lot of people are talking about. Yeah, they're living and breathing it, too, which is dope. Yeah. Well, uh, we The show must go on. We,

1:14:39

our next guest is already here. We just rapid fire guests today, which is great. So, this one is, uh, someone who's been on before, but slightly different this time around. He's wearing a new jersey this time. Yeah, he's playing. Yeah, he's definitely wearing a different jersey. I

1:14:54

like that. All right, Annie, welcome. Hey, what's up everyone? Thanks for

1:15:00

having me again. Yeah, so all right, people who have been following us for a while have probably seen your face before. We actually just talked about stage hand earlier when we were showing our browser agent uh template and it uses uh stage hand under the hood. So I guess maybe can you give everyone a little background?

1:15:20

Yeah, sure. Um, so, uh, yeah, I was a browser base previously. Uh, we built stage hand, which was basically the goal of that was that, you know, browserbased is going to be the browser infrastructure. Um, and we just wanted

1:15:35

to provide tools for agents that can, you know, control a browser, right? So, our whole hypothesis was that someone's going to build an agent that's going to look at a screenshot and determine like, oh, in order to order a pizza, I must click this button. And that agent shouldn't have to worry about, you know, writing the right playright code and choosing the right DOM selector. They should just kind of like offload that to

1:15:52

a different tool selector or a tool executor. Um, and so stage hand was that tool executor. And so yeah, I think like you know when MCP came out, it was such a hand and glove fit, right? Because you know we were building tools for agents

1:16:05

and MCP was like here's how you expose tools to agents. Uh and so yeah, um we were like pretty early in the MCP space and then um frankly like over time people started complaining about our MCP server and they were like oh it doesn't work in these situations and you know like I tried it in cloud it works fine

1:16:23

but in cursor it doesn't and you know I was like damn like this is coolish feedback but the only KPI we have right now is GitHub stars which is just like I don't know how people are actually using it what they're using it for um you know how how like valuable is this feedback? back in the context of like product

1:16:39

roadmap and so I realized like okay like MCP is pretty useful but we're definitely not the only ones facing this issue and so that's what really compelled me to build smithery um kind of like you know MCP observability without a platform to like make MCP itself bigger wouldn't stand alone on its own so we wanted to basically build an MCP

1:16:57

platform with observability as like a first class citizen so here we are we should have the TBPN trade board thing that they have and trade traded. Well, not really traded. You're a co-founder, right? Yeah. Yeah. Not not traded, but also like, you know, moved on to Yeah. It's like made my own team. Yeah.

1:17:18

It's like Yeah. Yeah. Started a new started a new helped start a new franchise. Yeah. New franchise. Tell us the story

1:17:24

like how you and Henry like linked up. Yeah. I think you know, like I mentioned, uh we were pretty like active in the MCP space. You know, we had uh couple events that we'd go to for MCP stuff. you know, we'd both be speaking

1:17:37

at the same events and then we just kept trading notes. Also, Browserbase was an early design partner. I think the way I first met Henry was like just a couple weeks after MCP came out, Henry had built Smithery and he was like, "Hey, deploy the browserbased MCP server on Smithery." And we were like, "Okay, you built like a marketplace for like the 20

1:17:53

MCP servers out there. Like, sure, right?" I remember even you guys at the time, you had like the MCP registry registry and it was kind of like this like kind of, you know, gag. Um, and so, you know, we were like,

1:18:04

"Sure, fine. list on smithery and then quickly you know as we were running blind and we didn't have any KPIs besides GitHub stars our KPI became like usage on smithery which is kind of like a a really interesting you know paradigm shift right um and it was only like one number and that number could only go up because it was just usage uh and so um

1:18:24

yeah that's how we met I think like Henry and I kept trading notes about like you know we were early design partners like browser base is a big enough logo to kind of help smither that direction too so just trading notes and then we realized that okay there's probably like quite a bit of overlap here that it makes sense for me to just kind of help set direction of smithery itself.

1:18:41

That's awesome, dude. Yeah. How long you been there now? A couple months. Um so like since uh

1:18:51

like Henry and I think we started talking like you know um May and I think you know officially made the leap in in July. Nice. Nice. So what's like what's new for Smither or like now that he has more

1:19:03

firepower? What are you all going to take over the world with? I Okay, so like fundamentally I think you know um I think like you know I was talking to someone at Deepmind a couple weeks ago and he was kind of the opinion that SAS itself is like not worth building. Uh, and he was basically saying like, "Oh, you're just going to

1:19:21

tell an agent, build me Salesforce, and then this big huge model is going to say, okay, I'm going to build you the front end, the back end, like everything. You don't need any other services. You know, the model can just do everything for you on this like remote cloud instance." Um, and I was like, damn, like that that's really

1:19:38

scary. Like, we're all just doomed. Uh but I think like what what saved me from from that uh paralysis was rather just like uh I'm a software engineer and like you know three years ago before AI and everything if I had to set up a database I wouldn't just go straight up you know like set up an EC2 instance like set up

1:19:58

Postgress on that EC2 instance like I could do that but at what cost and why I think you know even as a human who's capable of doing these things it would just make my life so much easier if I could just kind of offload that to like a superbase or neon Um, and so I think like the future that we're headed towards is one in which, you know, you could kind of give Cloud

1:20:15

Code or some general agent something like here's 20 bucks, here's 30 bucks, go build me Salesforce. And then that agent will basically like offload, you know, um, okay, I need a front end, a back end and infrastructure. And it's going to say, okay, front end agent, here's 10 bucks. Uh, go build out the front end. Backend agent, here's 10

1:20:34

bucks. Infrastructure agent, here's 10 bucks. And then these like kind of g like specialized agents will take this task of like building out a front end and then they'll say like okay do we use Verscell or Cloudflare um and then from there it's going to like offload the actual building task to like an X.js JS agent, right? So, all this being said, I

1:20:52

think like there's a huge gap for kind of the orchestration of all these generalized to specialized agents. Um, and it's very akin to like regular software engineering where, you know, all of SAS is just this serviceoriented architecture where you just, you know, kind of hand off API requests to more specialized services. Um, and I think

1:21:10

the same could be, you know, applied to AI itself. And I think we're just trying to help like build this new um AI service ecosystem. Nice. So like Smithery would be Smithery

1:21:22

would have a bunch of MCP servers that people could like you know off offload certain things to dynamically maybe um or whatever. That makes a lot of sense. So we had the concept of like a profile. So you could have like your work

1:21:35

profile, your personal profile um and all these have kind of like managed connections and everything for you. So that way you don't have to like log into notion on like 10 different clients. You could just, you know, save all your config and smithery and then log into smithery on your different clients. Yeah, that's that's super cool. I saw

1:21:52

you guys just released like OOTH MCP support or something like that. Is that true? Yeah. Yeah. Um yeah, so now all of our

1:22:00

MCP servers are fully OOTH compatible. Sick. Um which is super nice because earlier you had to like kind of copy paste API keys which you still kind of have to do. I think the biggest problem with MCP is that like there is one spec and like

1:22:12

nobody actually follows the full spec and so you kind of run into this problem of you know like kind of like whatsApp did where you know they they had to support like the most legacy versions of Android because they were just trying to expand accessibility to messaging in general. Uh so that's kind of like our kind of problem to solve as well. We

1:22:29

kind of want more like one of we want to work on any MCP client even you know cloud desktop which surprisingly has kind of like the worst support for MCP. Yeah. Why is that? We've noticed the same thing. It's like the you created

1:22:43

MCP. You should have the best support. There should be support day zero support in cloud desktop for everything. But

1:22:48

that is not the case. They're obviously different teams. Yeah, it's pretty insane. And I think, you know, we're lucky enough to be on

1:22:54

like the MCB steering committee as well. And uh it's just I think when you have something very communitydriven, you end up kind of like the DMV where there's no reason to be efficient. Um I think that we are trying to kind of just be that brand like the neutral layer, you know, across the hyperscalers, across the in providers, kind of just being the

1:23:11

company that kind of brings efficiency to MCP kind of just, you know, helps people actually use it a lot easier. Um we actually have, so we're shipping this week, we have observability in beta. Uh let me show you actually. Actually, can

1:23:22

I share my screen? Yeah, of course. Yeah. Super sneak peek. I'm literally vibe coding this right now.

1:23:28

We always love always love live demos. Live vibe demos. Uh let's check this out. Cool. Um so this is as you can see very very bare bones. Um but what we

1:23:40

have here is like a leaderboard of kind of what clients people are using over time, right? Right. So I mentioned we're super super you know neutral across all the clients and all the um hyperscalers and like you can see here like how many people are using cursor how many people are using cloud code or windsurf you can

1:23:56

see here you know the biggest usage is actually you know Trey actually has surprising usage here over windsurf even um I didn't really realize that I think you know we have a lot of people in kind of East Asia kind of using um Trey and there's like another one um that was surprised there's a lot of MTP clients here that aren't even like English. Uh

1:24:17

but I was kind of shocked by that, right? Uh I think what's also really kind of u not the end all be all is that cursor is still far and away ahead of cloud code. Um so you know it's not the end of cursor or anything but yeah this is just the distinct users that you know have used an MCP server in the last you at least like started a session that has an MCP server in the last 24 hours. Uh and

1:24:41

then we also have if you are a um server developer um I'll show you guys our our notion MCP. This is fully open source. Uh shoot I need to pull changes from main. Anyway, point being we now have better observability uh that you can

1:25:01

basically see like you know who's using your server over the last 24 hours. What do they do? Um how do I stop sharing my screen?

1:25:13

There should be a a stop button somewhere. Let's try. I got it for you. There we go. Thank you so much. Yeah, the point being, yeah, like we have like, you know, a lot better all the

1:25:24

stuff that I was complaining about where I was like, I don't know how people are using my server, how many people using my server, what clients are they using it on, where is it breaking, stuff like that. We're starting to actually kind of surface that back to developers um to kind of hold the big players accountable for building good software. When you're

1:25:38

building a marketplace like this, um, are people like asking y'all like what are like the blessed tools? Like which are the blessed servers? Like I only want to one use the one that Henry said was okay. Like is that like stuff that's happening? 100%. It's like um it's pretty

1:25:57

interesting too like you know because we're a fully open marketplace uh we have like seven different implementations of the notion MCP. Um, and so yeah, like what's weird is also that the original Notion MCP server wasn't that good, which is why we actually built our own Smithery Notion MCP. Um, and that we found works better.

1:26:16

Uh, but yeah, it's it's interesting, right? Because like I think with the open marketplace, you also kind of get to hold the big players accountable and you get to say like, hey, like the MCP server that you built kind of sucks. And I think that, you know, like a lot of people just built MCP servers to have an MCP server. Um, yeah.

1:26:33

And so they don't need to kind of like make it good. Uh, and so I think that's like, yeah, that's what we're trying to do, like incentivize more open source MCP servers such that, you know, if the big players have an MCP server that doesn't work, um, you know, you could just update the open source one and kind of like hold them accountable. But

1:26:50

yeah, which is really interesting because, you know, you can't do you can't build a API for, you know, notion, right? You you're you can't do that. But you can build an MCP server that good that uses their API, right? I think, you know,

1:27:05

typically, you know, at one point APIs were, you know, people just built them to build build them, right? And they were not very well-maintained, but over time they had to, you know, people wanted to use them and so they had to make them better. But now that they have good APIs or most places have relatively good APIs, it kind of opens the door for

1:27:23

anyone to build an MCP server using that API. And so whether you're building it just for yourself or you're building it to share with others because you think your implementation, your your tool descriptions, whatever are better fit, you have a better collection of tools uh using the API, you can kind of just build and release your own. Yeah. But this is the problem I've see this history repeating itself for like

1:27:46

the fifth thousandth time. Like we've seen this ourselves in the plug-in like the tragedy of the plug-in ecosystem where people when you get like when technology proliferates people have want to do the same thing like you got a notion I got a notion everyone got a notion and then we're humans so like

1:28:05

we're not always consistent in maintenance of these things that we put out in the world they degrade users then because this technology is so popular they picked the wrong notion server and now they're in production. issues later on. And then it always goes back to the authors. They're like, "Hey, how come you didn't have and we had this one back

1:28:25

in the day, Gatsby had two Shopify plugins, two WordPress plugins that they were always in contention because the authors of the other one were like, "Hey, like you guys move too slow on supporting Shopify." It's like, "Yeah, dude. We're not Shopify. Like, we're like building a whole plug-in

1:28:41

ecosystem. Like, obviously, we're not focused on this." Anyway, sorry, I digress. Um, it just always goes back to

1:28:48

the people who own the marketplace have to have some type of judgment. This shit's good. This is not good because people want to blame you when it goes wrong because everyone needs someone to blame. So, I think y'all should get

1:29:00

ahead of it and just start christening things from the beginning because they're already going to blame you at the end. Might as well just take a a new strategy and say, "These are the sick MCPs and it could be all yours." That's actually kind of like what we have um to some degree. Like I mentioned, we forked the notion MCT server. It's just tricky with like

1:29:17

established brands. Um cuz it's like, yeah, I don't want to like go one v one notion. Uh but I think the the bigger existential risk is I think like and and also to be fair, it's not our responsibility to kind of make every existing service MCP friendly. Um I think like you know chatbt for example

1:29:38

when it first came out there was no chatbt app and there were like a hundred different chatbt apps in the app store. Um but nobody used those apps because I was like I'm just going to wait for the official one to come out and be good, right? And so I think what what we want to do in the meantime is kind of just make the existing MCP servers that are kind of out there really good. Um and

1:29:55

incentivize people to actually build good MCP servers such that when they build better first-party MCP servers, you know, they're kind of up to the standard. Um I think that you know what's interesting is like the risk to notion is not like you know someone building a better thirdparty notion MCP server. The risk is more that you know

1:30:13

there is going to be a more AI native note-taking platform that is just better for people to use and has better reach kind of like what linear itself did to Jira. Um if someone builds a better AI native task management platform that could basically overtake the linear MCP server. Yeah. Uh so I think we want to just facilitate this kind of like a native marketplace and help people build more a

1:30:36

native apps right uh so yeah I fully I fully expect that over time you know it's just natural it kind of expands and then it condenses right you it starts small it expands and then it'll condense around the the good MCP servers and hopefully those good MCP servers are built by the the parties themselves right so they're first party support they you know they're maintained over time that's not always going to be the case Sometimes the the third party

1:31:02

MCP server is just going to be better because they actually care and the first party actually doesn't care about their MCP server. And maybe that means they lose long term as a as a product or maybe not. What time will tell but I think that you know I guess you're you're kind of there for the ride, right? You're going to provide the ability for all all these servers to get there. You're going to provide the

1:31:20

observability so for people to determine what you know what's used and what's the best and maybe I don't know if you have ratings or or whatever. And then over time it's like it's the market will consolidate on the winners. And I don't think you can necessarily fight that. All you can do is provide the tools to help people make those decisions but

1:31:38

also help others if they want if they think theirs is better get theirs out there. And I think that's that's the challenge but also the opportunity of a marketplace. I think what's also really cool here about like new marketplaces and new platforms in general is kind of like we don't know the use cases yet, right? Like it's just this brand new platform. A lot of people aren't using it. Um, I

1:31:56

think like when when the iPad came out, right? Like that's probably one of the first major technology shifts that I was, you know, really alive enough to remember. Uh, and and I remember like when it came out, people were like, "This isn't a phone. This isn't a laptop." Kind of like what is it, right?

1:32:08

Like why would I buy this? I don't think in 2010 people could have foreseen like every POS system everywhere is an iPad. I think it's like the last use case people imagine, but it's like the most or it's basically your dash on your on your car in some in many cases too, right? It's basically what it is. And like there's there's so many use cases that we can't even fathom, right? Like

1:32:26

the other example I like to draw is like you know when the app store itself again came out, right? Uh people I guess you could make the argument why would you need an app when you could have a mobile friendly website. You know there's Cordova, there's Phone Gap where you could kind of just port your website into like this iframe rendered as an app. Uh, and then yeah, like Instagram

1:32:42

came out and they were like, you know, we're an app. We don't even have a website and look what happened to that, right? Tik Tok, Vine, like all these things are just platform native and they don't really need an analog. And I think

1:32:53

that that's kind of like what I want to enable. And I have no idea what those actually are yet. I think we're still in the age of, you know, when the iPhone first came out and there were like fart apps. Like that's kind of like where we are. It would be wild to see MCP server,

1:33:04

which whose whole code base is for the MCP server? Yeah, I was just going to say that. So, so do you you're predicting so Annie your prediction is that there's going to be a uh just MCP servers that maybe don't even have websites or maybe the website is just to like point people to the MCP servers. Like it's just a landing page that says,

1:33:23

"Hey, we're an MCP server. Here's how you get it. Go here." This actually literally happened a

1:33:28

couple weeks ago. There was this product that launched on Product Hunt. They made it to top 10 and it was called Seek Easy. uh s e k e- s y and uh like some

1:33:38

Google engineers. Really cool. It was just like an AI enabled Yelp. So you could say, you know, like hey, what are

1:33:43

some good restaurants in like the Flat Iron District of New York? And it would respond with like these influencers in New York think that these restaurants are really good. And it was just an MTP server as a service. And exactly what you're saying is exactly what happened. They didn't

1:33:55

post their landing page on Product Hunt. They posted a link to the Smithery server page where people can see like, oh, what tools are accessible? you know, like how do I kind of interact with the server? I thought that's like exactly what we're trying to enable. Like seek easy is not

1:34:06

an API. It's not an existing app or website. I think they have an app now, but it was just fully MCP server first, AI native first and then see what sticks. Nice. So, yeah, just Yeah, exactly right. Like

1:34:18

I wonder what more is going to come out that's like, you know, the next Instagram or the next, you know, like generation defining app. Well, good luck with building it. that'd be crazy.

1:34:29

Yeah. Yeah. Yeah. come back on, you know, give us updates along the way. Definitely if you got big launches, you

1:34:35

know, things that you're doing our audience, you know, and, you know, selfishly, Abby and myself, we we're obviously very interested in MCP. We do a lot with MCP and we're very excited, too, that you're sponsoring the hackathon that's going on now. We talked about it earlier in the show. Uh, you are giving away uh Nintendo Switch to

1:34:54

any the best use of MCP. You get Are you the one that's judging? You're the judge, right?

1:35:00

Uh, yeah. All right. So, you want to see the best usage of uh Smithery and uh go get that Switch. For those of

1:35:07

you listening, like you can you can win yourself a Nintendo Switch. That would be pretty sick. Hell yeah. Well, can't wait, guys. Thanks so much for having a sponsor. Uh for having me on today, coming and hanging out with us. Like,

1:35:19

I'm glad you're at another company that we really like. So, yeah. You just went from one one.

1:35:25

Yeah. One one friend to another friend is All right. Hell yeah. Thanks you guys for having me

1:35:32

having Thanks for having me on your website as well. Like every time I go to master.ai that tweet for me.

1:35:37

All right. That's what we do. Cool. All right. See you

1:35:43

guys. Dude, last two episodes have been wild, man. Just everyone's cool and working on crazy things. It's dope.

1:35:55

Yeah, we're getting obviously, you know, like we're getting more and more guests that want to be on, which is great. So, if you are watching this, if you want to be on, if you're doing something interesting in AI, come tell us about it. We like having interesting guests that are doing interesting things. Yeah. More interesting people we know, the

1:36:13

better. Yeah. Let's do a recap, I guess, for everyone before we get into the news. What we talked about today. I'll try it. I

1:36:19

haven't done I was gonna say, you want to do it? Let's do it. Hit me with it. All right. All right. So today um we uh

1:36:27

started the show with a hackathon update. Uh we had Shria come on. We talked about you know what the hackathon prizes are. A lot of people who are judging what we're looking for and what we kind of want to see. We looked at

1:36:40

some code from the browser agent um which is using stage hand which kind of connected the whole episode together which is cool. Um then we talked to OSO about authorization especially in rag based uh systems like they were showing. We're also users of OSO. That was cool. Confident AI. Our friend Jeff from YC um

1:37:03

he's also a judge for the hackathon. Uh talked about deep eval just different eval things that we're talking about. I think anyone who's doing eval should go listen to that segment. Thought it was pretty interesting. Then we had the

1:37:15

homies from Smithery. That was an interesting ass conversation. Like I don't even know how we talked about all that, but it was just flowing. We talked about honestly just go watch that segment again too. Like I can't even I

1:37:29

can't even repeat it. It's just interesting stuff. Yeah. Cool. Cool to see all the observability coming into MCP MCP

1:37:35

marketplaces. Thinking about what MCP as a form factor, you know, could be. Is Yeah. Yeah. It's not obviously it I I think we're still figuring it out and and

1:37:46

that's what's kind of exciting. Which reminds me, I want to write a blog post about the tragedy of plug-in systems. I'm going to write a note for myself to do that at some point. Yeah. I I mean we are we are going to in

1:38:00

some level be facing that again with templates, right? Exactly. So it's like it's like one of those things. It's a necessary evil that you

1:38:07

have to go through if you want to provide value. I think because I do think those systems, the plug-in system or MCP marketplace or a template, you know, directory provides value to people, but it's they're challenging to like regulate and like make sure that the best ones show up and the bad ones, you know, people don't blame you when they don't work, which they will, even

1:38:29

if they're contributed by someone else. So, it is a challenge, but I do think a worthwhile challenge, you know, if you if you do have users that can get benefits from it. Yeah, 100%. I'm excited to see how this

1:38:42

goes. Oh, we got a little comment. Justin, nice. Familiar familiar name. Eva's for

1:38:50

sure important. Just had a real example of my hackathon v2 template. Thought I made a great update. Nope.

1:38:57

All right. Usually how it goes, dude. Yeah. Well, good to see Justin's working

1:39:03

on uh working on something for this hackathon. Yeah, it's going to be I don't know. I'm excited for Friday to start seeing all you know, obviously see all the sub submissions come in and then next week the plan is we're going to talk about it. We're so next week's show we're going to announce, you know, showcase

1:39:20

all the the winners from the hackathon. We will talk about, you know, who got what prizes. we will showcase them and just kind of talk through what what the next steps are for you know for templates for h you know for the hackathon hackathon results all that. Well dude should we do some news?

1:39:40

It's time for the news. We gota you know keep on keep on going. Yeah, you know I thought it was going to be kind of a low news day and I don't think it's too Yeah, there's nothing too major that that came out over the week but some really good stuff to talk about. the the first I will uh

1:39:59

I will share which is I thought just kind of funny. I don't know just a you know casual observation from Cloudflare. This this tweet caught a lot of people's attention. Perplexity is repeatedly modifying their user agent and changing

1:40:16

IPs and ASNs to hide their crawling activity in direct conflict with explicit noall preferences expressed by websites. Dang, shots fired. They're calling it out. They got a whole blog post on it just throwing perplexity

1:40:30

under the bus and you know, it's it's the drama. You know, there's always some kind of drama going on every week and this seems to be it. I always find it interesting that Cloudflare is always in these dramas. Just want to throw that out there. No

1:40:47

hate on anybody though. Yeah. Yeah. No, no, no shade thrown, but Cloudflare, you're you're always in it,

1:40:53

you know. Quacks like a duck. Uh, that's a good one though. You know, the thing is Arvin probably doesn't give a Arvin Trinas. I shouldn't say like I

1:41:05

know him. Arvin Trinavas does probably doesn't give a That guy's on a mission. Yeah. I don't feel like we know him, but we did get to we did see him. So, I feel like we you know, we know him a little bit.

1:41:18

I I I get the I get the vi I get that vibe, right? Like he definitely gave off that vibe that he's like, "Yeah, all right." Dude, I remember during YC when he talked to us, he was like, "Um, oh yeah, I didn't get into YC, but you know, it all worked out anyway. It started off like just you guys."

1:41:36

Kind of threw YC under the bus, too. Yeah. You know, he doesn't care. Seems like, you know,

1:41:42

he he no fs given from him. At least it seems that way. So, you know, we will see if I doubt there's any kind of official response or anything, but I thought that was an interesting tidbit of information. Do with it what you want, but you know, Cloud Cloudflare's

1:41:59

in it. Cloudflare finds a way to make it into the news, though. You know, so maybe that maybe that's what their plan is like. You know, they're going to stir up some controversy and then they keep uh on top of top of mind of people.

1:42:14

So, uh, the next thing we want to talk about, AISDK v5 finally launched after, you know, we've been talking about it feels like for months. We've been kind of previewing the, you know, all the things that were coming uh, coming up with it. You know, we're sharing a bunch of tweets, but it officially launched last week sometime. I don't remember the exact day, but it's it's here. It's

1:42:34

live. Yeah, it was like last Thursday or something. Yeah, that sounds right. Yeah. Um, congrats to ASDK team. Like,

1:42:43

good for y'all. Um, yeah, great good things about V5 over V4. There's a lot of struct like more structure to the data stream protocol which will allow you to write better UIs and more featurerich UIs from, you know, from your agents. You can store metadata

1:43:02

on these messages and stuff like that. Um, outside of that, everything relatively is the same. there are just new message formats, new data structures um that didn't exist before. Definitely

1:43:15

makes our lives a little bit more complicated. Um but uh yeah, that's cool. In terms of our V5 support, we have V5 support in a snapshot branch is snapshot branches are like beta branches that we give out. I would say that branch is a breaking change. um

1:43:35

embracing V5 as a breaking change and you know you're more than welcome to start using it and you know if you want to use it that way um we're currently working on a different strategy so we don't have to break everyone who wants to use V5 versus V4 um and so that we have an alternative path that we're looking at there so it should be pretty seamless uh to use wherever you're at V4

1:44:02

V5 V6 whatever whatever the going on. Um, so yeah, stay tuned from our part. I know it just came out, but some patience will be required uh because we have some other things cooking that are uh along the same lines. Yeah. Yeah. Soon, but yeah. Not not imminent. Not not today or tomorrow, but

1:44:22

but in in the matter of uh I would say what a week or two. Yeah, like a week or two. Yeah. So, a little patience, but our goal is to make your lives easier. So if

1:44:33

you want to use it, you don't have to worry about Yeah. doing a big breaking change. And so we're trying to take a little bit of caution and just make sure we do it right. Yeah. Also, you know, the thing for us that I

1:44:46

want to like point out is like we have a surface area of all our users as well as integrations into MRA that are expecting a V4 format and that is also a coordination effort from our perspective because we are friends with a bunch of different libraries. We're not going to just break on them and say, "Hey, you got like good luck, bro." Um, so we want to do something that allows their

1:45:11

stuff to continue working. Like we just had a co-pilot kit thing in person last week. We have one next week.

1:45:17

I have a Yeah, we have a workshop with co-pilot kit on Wednesday. So, you know, check that out. You know, imagine we broke then we had to get everything ready for the workshop. No thank you. That's so much stress as it is.

1:45:28

So, uh, yeah. So, more to come. Yeah. Yeah. We want to make sure we uh

1:45:34

are able to keep working for everybody. So a little yeah we I have heard though you know I haven't obviously I don't have any AI SDK just on their own projects but I've heard that the upgrade path is relatively easy you know hasn't been too bad for people some a few people mentioned there's maybe some

1:45:53

minor type issues or things I ran into but overall like I've heard relatively easy upgrade path so if you are using it you know and for some reason you're not using MRA you know should be pretty easy of an upgrade. But, you know, definitely read the docs and and uh your mileage of course is going to vary, but it that does seem to be pretty simple.

1:46:12

It will be a it'll be not good if you're using like that. They used to have this thing called chat store v1 or message store, which they broke like during V4. So, if you're still on that you're going to have a tough time because you have to like migrate a database and all that. Actually, their recommendation was to delete everything and start over. I

1:46:30

don't know if that's going to work for you. Uh, probably not. Which is why you should do something like MRA because then it's our problem, not yours. So,

1:46:38

yeah. Yeah, we provide the more seamless upgrade. So, we do got a comment. Jelp,

1:46:45

hey, I realize the hackathon finishes this Friday and I didn't start anything. Well, the good news is you have until Friday. You got time. There's still time. Get on it. Yeah, get on it. You got time. Uh, come

1:46:58

into our Discord. Let us know what you're building. And if you have uh if you have questions, I don't think we mentioned this earlier, which we should have, but we do have office hours on Discord on Wednesday if you want to come hang out with me and Shreda and chat about what you're building, get help, ask questions, just hang out. Uh the

1:47:16

time is I better make sure I know. It is at 10:30 a.m. Pacific on Wednesday, so

1:47:25

come hang out. All right. Next up, we are of course seems like we can't go a week without talking about uh cloud code. So, we're gonna talk about cloud code.

1:47:37

Last week, they did announce uh they are adding some weekly rate limits if you're using the max plan. So, you know, I I jokingly blamed Tyler on our team because he said he would have spent $2,000 but only had to pay the $200. He would have spent $2,000 in one week but only had to spay the $200 a month. And then the next day, Claude Conn rolls out this uh new uh new limits. So, he will

1:48:03

hit those limits for sure. But yeah, joking aside, seems like they said it only impacts, you know, 5% of users that'll hit limits. So, it doesn't impact a lot of other uh most users that are on the max plan, but it sounds like, you know, some people were really just letting them letting Cloud Code run wild

1:48:22

and it was costing them a lot of money. So, dude, but I think like what are y'all doing and spend like what are you actually doing? Are you like inventing the future or something? Spending $2,000 or whatever. Dude, Opus is expensive. I told you I

1:48:36

spent $100 in a weekend on one session, you know? Yeah, but like you're just doing some minor tasks, right? Like those cost $1,000. Hundred bucks is fine, let's just say. But like what are you doing spending like thousands of dollars? Like

1:48:48

what could that task be? Is it like what if it's like solving AGI? Like it just The funny thing is, the funny thing is it would tell you it did it like, "Oh, yes. Here it is." And then you say, "Oh,

1:49:00

you did this wrong." It says, "You're right. I did do it wrong. I'm terribly sorry.

1:49:05

I'm terribly sorry. Here, let let's do this other approach. Let's spend another thousand dollars.

1:49:10

You're absolutely right. You're absolutely right. I did it wrong." Um,

1:49:16

yeah. I don't know. I I mean I do think Opus is very expensive. So I just use an

1:49:21

API key because I don't I didn't think I was going to spend $200 a month. I probably should just get the max plan, but I Yeah, I'm just using Opus and obviously it's significantly more expensive. Uh but I I wasn't like my session wasn't intense over the weekend to spend $100. I could have easily spent two or three hundred if I was, you know, act, you

1:49:40

know, I was doing some work, but I wasn't like heads down the entire weekend. You know what I mean? It didn't take that long to spend $100. So, I don't know. I think uh Claude's

1:49:50

pretty expensive, especially compared to other models. Yeah, it's obviously I would say slightly better for coding and slightly better is kind of a big deal if you're writing code. So, I think that's why everyone uses Claude, but I do think that over time those other models are going to continue to catch up. So, I don't know. Cloud Cloud needs to fix some pricing issues. I think I think it's still a little too

1:50:12

expensive, but or maybe like they got us hooked on the drugs and now they're you know they can set the limit. Yeah. You know, you're hooked.

1:50:26

And also I don't I hate that you can't get the same kind of like flat rate or type something with like just API key. like you're selling a max plan, but then what about all the money I have in my API credits or like or my API bank or whatever? Like I have to have a separate plan now. It's annoying. I mean, it's how you make money though. So, hey.

1:50:46

Yeah. So, yeah. Marcelo says, "Kimmy K2 is too good.

1:50:54

Haven't tried it, but heard good things about it as a coding model." We're gonna talk about actually some of that stuff here a little bit later. I got a couple other comments. Justin says

1:51:08

to we're just like re replaying a conversation that's happening in real time in in YouTube. So if you are watching this recording or listening this recording later, you got to be here. But Justin says uh Jelp templates from what I understand don't have to be crazy. Just an example on how to properly use Maestra. Maybe think

1:51:26

utility use case instead of massive massive agent network. Agreed. Agreed.

1:51:32

Jelp says true. I might just create a simple schema creator or something in my previous master project. I ran out of OpenAI credits. Lol. I was going to submit that, but it's way

1:51:44

more complex. Well, Job, you you should submit something, but you you know, we'll take whatever. Yeah. But and then Justin says as we last one

1:51:57

and then I will stop. Yeah. I mean my main reason is just to help others have their aha moment and really even simple use case will help someone. Think what would help you jumpst start. Okay. So I

1:52:09

just recapped a conversation happening in YouTube which you can go check out yourself. You should be there. But uh community. Yeah. Yeah. We do appreciate you having

1:52:21

this conversation on the live show because yeah, we're in this together and templates are exactly that. You we do appreciate complex use cases, but it doesn't have to be complex. A simple template that helps someone get started is a a really good thing because it allows more people to build agents, build workflows, you know, get started quickly with MRA. All right. Uh so we have some more cloud

1:52:47

code. We're not done with talking about cloud code here. We never are.

1:52:53

So in this case, uh let's see this one. So cloud code can now work across multiple directories in a single session. So you can add directory paths to add working directories. Before it

1:53:07

was limited to kind of just one directory. So you can have like it go across a front end or a back end or a shared directory as you can see in this video. So this was released, you know, kind of last week on the 29th. Haven't used this yet, but I do use cloud code quite a

1:53:24

bit. So I imagine this is something I will use at some point as a way to kind of like especially in like monor repos maybe just like specify specific directories to work across. So that's cool. Anything on that? Nothing. No, but it

1:53:42

made me think that uh we should invite Matt back on because he just built an app without looking at the code and he's a huge Claude code user. Matt B, by the way. Um maybe think Matt B from Netlfi. Yeah, he's a cloud code maximalist, I guess.

1:54:03

So, which is Yeah, he he and he makes, you know, makes an appearance in our stream starter now. Did you see a new stream video that we got beginning stream? Yeah, he's he's in there. You know, he's he's been a guest. He's a friend

1:54:14

friend of the live stream. All right. Uh so, one other thing, if you are getting in if you want to be a Claude Code maximalist, there's a new video four days ago from Enthropic Claude Code best practices. I watched a little bit of it. I think you know I watched it

1:54:32

10 10 minutes in is when the meat really starts. So, you don't have to watch the whole thing, but look up the Go to the Anthropic YouTube channel. Look for cloud code best practices and if you are figure trying to figure out how to get the best out of cloud code anthropic maybe not the team behind cloud desktop

1:54:50

but the rest of anthropic can tell you how to get cloud code that it's an MCP joke it's a bad one but um I think it's funny but cloud code best practices uh yeah learn learn how to get the best out of cloud code. All right, next up, let's talk about AI engineer world fair. Engineers, you know, is it conference, world fair? I don't know. Conference,

1:55:18

whatever. A engineer conference. Yeah. Which one though?

1:55:24

Yes. So, first we're going to talk about the last one because the author of this book, our co-founder Sam gave a talk at the last AI engineer conference in San Francisco and that talk has finally been released. So, you should go check it out. Agents versus Workflows. Why not both? With Sam

1:55:43

from Mastra, our co-founder. It's got 10,000 views in the last three days. Let's make it 20,000. Let's go out and,

1:55:50

you know, watch it. Let's learn. So, wanted to talk about that. There's I think they released, you know, I've seen

1:55:56

a few other talks that they've also released on YouTube. So, if you're looking to learn more and you didn't you weren't able to go to that conference, go check out the AI engineer YouTube channel. can probably find some good talks, learn some things. I'd recommend this one, of course, but there's others that are out there.

1:56:13

Yeah, there's like a workshop that is referenced in the comments here, like AI engineers like, hey, he he mentioned a MRA workshop and then they linked it so you can watch another another MRA video that happened and we didn't even know about it at all until we were at the conference, which is cool. And next up, you know, continuing on talking about AI engineer, there's a

1:56:38

another conference coming up and Obby, you're going to be there, right? I think I'm going to be there. Are you going to be there? I think I'm going to be there. So, if you're in uh and you know, Sam's going

1:56:49

to be there. A couple other people from the Monster team are going to be there. So, you should uh if you are in the area, you should come say hi. We will be around. Yeah,

1:57:02

we'll give you a book. Come find us. And yeah, we will we will be there in in mass, you know. We will we'll have people there with us. So, please come say hello. It'll be a it's going to be a

1:57:14

cool time. I'm going to learn a lot. I'm sure you will learn a lot. Going to be great. Yeah, that that's it. That's that's AI

1:57:22

engineer. That's that's that topic. All right, we got two more news big news topics to talk about today.

1:57:30

The This one I can't say that I've really kept up with a ton other than I just can't stop seeing mentions about like Chinese LLMs. Seems like they're literally everywhere. Uh especially like open models. So,

1:57:47

a few uh kind of updates around just some of the Chinese models that are coming out that are kind of open models. The first is from Z AI GLM4.5 reasoning coding and agentic abilities.

1:58:04

So here's the it's just from the Z.AI blog post and they introduce two models 4.5 and 4.5 air. They're kind of their flagship models. They have comparison of performance on agentic reasoning and

1:58:21

coding benchmarks. So they ran it across 12 benchmarks here. You can see where it performs. So

1:58:29

as a model it performs pretty well, right? You know, GR 4 just came out and was kind of state-of-the-art in some things. You know, they they're saying 03 is still state-of-the-art here by these specific benchmarks.

1:58:41

And it's obviously pretty high performing, especially compared to, you know, some of the other models that are out there. And then it has like specific breakdowns on agentic benchmarks, reasoning benchmarks, and coding benchmarks. And you can see like, you know, as as expected, Claude for Opus is still the

1:58:59

best in coding, but some of these other ones are getting pretty close. Yeah. I mean, Grock is still killing it.

1:59:07

Yeah. But the fact that they have a model that can perform like this is great. Yeah. So, I don't know if it's great for us, but it's great in general. I think it's great for just people building with AI

1:59:19

in in general. So, it's yeah, you know, I don't know all the details, you know, other than I've obviously just reading through the the basics of this post, but there are yeah, many good models that have been coming out. It seems like, and this was July, it seemed like July was just like Chinese model

1:59:39

month. Yeah, dude. It was. I think every show we were talking about, you know, a new

1:59:45

model. I'm going to be honest, like I can't, you know, we do this stuff every day. I can't keep up with all the different companies even like behind behind all these models. So, it's it's been pretty impressive to see. They're they're on a kind of a a tear.

1:59:58

So, re somewhat related. This was just kind of an interesting blog post. So, Cerebras code has launched. So there two new plans to try to make AI coding faster and more accessible.

2:00:18

Code pro for $50 a month and code max. They give you Quen 3 coder which is you know at the time the world's leading openweight coding model. I don't know like you know as of August 1st they still say it is but it runs at 2,000 tokens per second which is really fast. So and no limits.

2:00:37

So it does uh it's probably the you know would be one of the fastest coding models. Damn. And you can see here how it you know how they're saying uh you know it relates they all have the you know similar graphic right on Sweetbench verified. Yep. Uh so you can see how model size plus

2:00:58

performance it's pretty good right. So, I know uh Tyler on our team is, you know, was pretty excited about it. I don't think he's tried it yet, but if you are looking for, you know, if you want to get around the Claude code limits, maybe uh you you can give this thing a shot and see see how it compares. Dude, I think these models are going to have like such a they're going to have

2:01:27

like their time in the dog fight because it is getting really expensive and people who know how to run models locally don't care like you know they don't want to pay the price, right? So, they're just going to do it themselves. It's going to be interesting. Yeah. Yeah. I I do you know the interesting thing too if you look at is

2:01:47

they're getting higher performance with smaller models, right? they're able to still get high performance with potentially smaller models as they go. And so what I'm very interested to see when we can have and it kind of already maybe exists with with some of these models, but I think it needs to still get better, but like a local running

2:02:04

model that isn't too resource intensive. Yeah. That you can just run anywhere. And I

2:02:10

know you can with with some of these models, but I do think they're not quite as good yet, right? And they maybe they'll always trail a little bit behind, but I do think there's still one or two iterations from being, you know, what what I I I feel like I get out of cloud code today, but it is exciting that it feels like it's getting really close. Yeah. But I think there's like a maybe

2:02:33

it's not like a knowledge gap, but maybe it's like a like uh maybe it's not so easy. In my opinion, it's super easy to use local models, right? use like LM Studio and just you're off to the races. But maybe it's because it's so much

2:02:45

easier to just put API key in that people don't reach out for the local models or like people have to navigate hugging face which might be like scary to some people, you know, but if you want to save money, like that's what you're going to have to do. Yeah. You know, for now, yeah, I think you're right. Uh I do it will be interesting to see you know how how the

2:03:08

price you feels like there's going to be a lot of price pressure right there's just so many models when you have this many competitors there's a lot of downward price pressure so I we we will we will see what happens but I I do think that over time maybe the and I I don't even know how profitable some of these model companies are right now you know I think that's why cla code had to put in the limits right they were obviously thought they

2:03:31

were losing money or not making enough margin So it'll be interesting on all these open models that you can run yourself how that like pushes because you still got to if you want fast inference for some of these larger models you need high power GPUs and that is still very expensive. Price is coming down but

2:03:48

still expensive to run. Yeah. Still hard to run like on a MacBook you know.

2:03:54

Yeah. And uh another related post, this one is from, you know, feels like we feature one of Simon Willis's blog posts at least every other week. Yep. But, you know, this is from Simon Willis's blog. And he says that

2:04:15

something that has become undeniable this month is that the best available openweight models now come from Chinese AI labs. And it talks about Kimmy K2, Quen 3, GLM 4.5 and 4.5 Air, which we just mentioned. And it all happened in like

2:04:34

20 days in July. From the 11th of July to the 31st of July, there was one, two, three, four, five, six, seven, eight model releases according to this blog post, which is wild. And it says the you know the only one that's kind of has an interesting license is Kimmy K2 which is using a non OSI compliant modified MIT license

2:04:59

and they they all offer their own APIs and they're becoming more available from other providers. You know we just showed like Cereas is offering Quen 3 coder models for pretty cheap. Yep. I think Fireworks AI also offers all these models too. So yeah. Yeah, it's becoming

2:05:19

pretty interesting. So you don't even have to run them yourself and you can still take advantage of these models and it's still likely Yeah, you verify this, but it's still likely cheaper, you know. I imagine it's still going to be cheaper. Yeah, just fireworks charging you a bunch of money. I'm just kidding. Yeah, I mean you're you're paying you're

2:05:35

paying a little bit of a premium, right? That you don't have to run it yourself, but it's fair. Like likely not as big of a premium as using one of the one of the big providers. if open source models become super

2:05:46

popular like it only it's only going to force the big the big models to uh release an open source model themselves right which we've heard so much rumblings throughout the news but we haven't seen it like man manifest yet last but not least before we wrap up today so Gemini 2.5 deep think is now available in the Gemini app.

2:06:16

It only is available for Google AI Ultra subscribers. So, you got to I think that's like the $250 a month subscription. I have it. I haven't tried Deep Think yet. I would be curious how

2:06:27

it compares to, you know, using 03 Pro or something like that. Yeah. But just another model improvement, right? They have 2.5 Flash, 2.5 Pro, and

2:06:41

now 2.5 deep think. I'm so I'm really impressed with Google's models, but I am so unimpressed with their integrations into their products. Uh, for example,

2:06:53

you can use you get Gemini and through Google Assistant, but that fool can't is is is shitty as shitty as Siri, honestly. I'll say it. Like, yeah, it can answer stuff, but it can't actually do anything. You know, it's kind of

2:07:08

whack. they pro they probably not fully featured or anything like that but yeah I mean it's hard to cross product boundaries I think right when you're that big of a company so they're building they build something cool in in isolation and then it's like how do we actually like integrate in this with other products it's like false advertising when they

2:07:27

say like Google Assistant powered by Gemini and you're like yeah dude that sounds sick let me try it out doesn't work the same way and you're like oh my god what is this anyway or even like how how bad of an experience it is to use V3, which is awesome. Like, it's really cool. Like, I I've made a lot of really dumb, but also

2:07:45

a lot of really fun videos with V3, but you got to sign up for this plan, and then you got to go to Google Flow, which doesn't sound like V3, but it's like product that contains V3 and it's kind of janky. It works, but doesn't work that well. Yeah, there needs to be some UX and DX improvements across this because like Gemini is powerful,

2:08:09

dude. Like if I was anybody else and I had my own model, I'd be like, "Holy where' these guys come from?" You know, like and then Yeah. But then, you know, that puts them behind.

2:08:21

Yeah. Especially with their long or the large context windows and Yeah. Yeah. And they kind of came, you know,

2:08:27

they caught up pretty quick, right? They're they're at least in the conversation and for a while there they definitely were not. Yeah. Gemini only reason like dude Gemini Gemini it's such an interesting thing because it's a very short timeline

2:08:39

but Gemini impacted AI news with the context model which then led us to say rag is dead right like there is this whole thing going on this narrative. So hopefully they get better. So now this this leads me and this is an interactive question. So if you are listening, respond in the chat with what you think. So let's give a timeline.

2:09:01

Let's say end of the year like we're let's say end of the year. Who do you think in best is going to be hard to judge because we're going to everyone has a different benchmark but you know just we're not going to fact check this. What do you think is going to be the rank these models at the end of the year? We have OpenAI. We have Enthropic.

2:09:20

We have you know XAI with with Grock. We have Google. I'd say those are like the four big ones, right? And then and then we have or open model like which one's

2:09:32

going to be the the top performing model. We'll call it the top performing general model. Where would you put your money on today? And if you're listening in the chat, what where would you be putting your money if you were to bet?

2:09:47

Where are you putting your money? I don't know. I haven't thought about it. I just thought of the question. I know it's not Claude. Like I know it's

2:09:53

not Claude. You're not betting on Claude. Okay. Open AI.

2:09:59

Yeah. I don't know. I I honestly think for me it's like I don't think the open model is going to be ahead. I think

2:10:04

it'll be surprisingly close, but I I honestly think that the end of the year it's going to be Grock or Google. It's going to be XAI or it's going to be Google as the best model. That's my prediction. It's going to be one of those two that we that you know I think

2:10:19

OpenAI's got this hype around GPT5. Maybe they surprise people. Maybe it comes out and it maybe they're working on something. Yeah, it can't flop. Can't flop. It just can't flop.

2:10:31

But I But I do think that it's going to be if I were to place my money, I think it's going to be on uh XAI or uh Google. Probably Google. Dude, I'm gonna say Google. I'm going to put Gemini. That's my bet. I kind of hope GPT5 flops. Not that I'm

2:10:54

a hater. I just want to see what that will do to the the world because it'll change very quickly. Well, I mean, I it can go one of two ways, right? And I think if you're listening to this, you're like you can

2:11:07

you tell me in the chat uh what what kind of which way you lean, but I think there's like either you think that AGI is imminent and it's nearly here. Nearly here meaning in the factor of within a year or two it's it's here and GPT five is going to be like basically it or or a huge step in that direction. And you think that the progress is accelerating. If you think the progress

2:11:33

is accelerating, that's like one thing. If you think that we're hitting some kind of natural plateau of what these LMS could do, that's like maybe you're like slightly decelerating or like the progress is not, you know, is not increasing. Where are you on that spectrum? Do you think it do you think the progress is continuing to accelerate

2:11:51

or do you think we're coming to some kind of natural plateau for at least some point in time? I'm not saying we're never going to get to AGI, but it might take longer than what people are what the accelerationists are thinking. So I think like in the big model like the big boy league um progress is becoming like marginal you know in the open model

2:12:11

league feels like the wild wild west and things are but then like you're you're comparing yourself to the big boys but if you compare yourself to within your open models there is like you know these big it feels good right to be better than people you know but then the big models are just inching each other out so yeah I think marginal

2:12:32

Yeah, I I do think that it we're I'm not saying we're going to hit a plateau, but I do think progress is slowing down. And maybe I will be wrong and GPT5 will uh everyone can say I'm a hater and I didn't uh I did not predict the future accurately, but I do think it seems like it's slowing down a little bit. Not maybe not a lot, but progress has

2:12:51

started to like they're coming they're having to come up with new benchmarks because you know eventually just knows all the facts and everything but just because it knows all the facts and it can predict the next best token with a slightly higher percent accuracy. I think every 1% increase is a huge you

2:13:08

know gain. you can unlock a lot more things. And so I think that as far as like progress is incredible, but the big jumps maybe. And I think people like will downplay like oh some of the marginal gains are actually huge when

2:13:20

you unlock different types of agents that you couldn't build before. But I do think in general like I'm kind of on the we're slow we're slowing down a little bit. Maybe there'll be something that surprises us but I'm I'm betting against it right now. At least in the next you know 18 months. Yeah. Maybe beyond that I I can't

2:13:39

predict anything. But uh so we have some people that are in the the chat. So B French says Gemini 3.0 Pro. Think he thinks that's going to be the best. Marcelo says Gemini.

2:13:53

Yeah. B French says Gemini edges out Grock on API architecture alone. Okay. Interesting. Justin says his rankings would be open AI. Okay. That's interesting. You're the

2:14:05

only one so far, Justin, that thinks OpenAI is going to be the top, then Gemini, then enthropic, then some kind of open model and with basically tied with XAI. And it is hard to say like what's defined best. Are we talking about coding? Are we talking about general tasks? Like what benchmarks? Like who

2:14:21

knows? I'm just saying as a feel. Should we do March Madness for models?

2:14:26

Dude, we'll we'll sit on that idea for a bit. Yeah. Yeah. How do you make a tournament out of out of models? Yeah. I don't

2:14:33

know, but we can talk to recall about that. Yeah, we certainly can. All right, dude. This was one hell of a show. Yeah, we went two hours and 15 minutes. For

2:14:45

those of you that joined us, whether you're watching this live right now or whether you are watching on Spotify or Apple Podcast or your listening on your favorite podcast app, thank you for tuning in to AI Agents Hour or AI Agents 2 hours and 15 minutes this week. Uh we talked about the master hackathon. Please go to master.build if you want to

2:15:06

build a master template. Come to our discord and talk about it and we will help you out. We talked with uh Vijay from OSO. We talked with Jeff from Confident AI. We talked with Annie from Smithery and we did a whole bunch of AI

2:15:21

news. So it was a great show. Yeah. Thanks everyone for being here and

2:15:26

you guys are awesome. See you. See you. Peace.