Back to all episodes

Does MCP Suck? TSAI Conf Recap, Ismail from Superagent

November 11, 2025

Today we talk about MCP and about all the recent online conversations around how much it sucks. We discuss the TypeScriptAI Conf, have Ismail from Superagent join us, and do all the normal AI news.

Guests in this episode

Ismail Pelaseyed

Ismail Pelaseyed

Superagent

Episode Transcript

2:42

Hello everyone and welcome to AI Agents Hour. I'm Shane. I'm with Obby.

4:15

What up? We're here on Monday like we usually are. And like we usually are, we started a few minutes late, but you know, we run a tight ship around here, so sometimes, you know, it slips.

4:28

But we have a great show for you today. We're going to be talking a little bit about MCP and trying to answer the question, does MCP suck or is there just a lot of drama going on? We'll talk quite a bit about the TypeScript AI conference which was last week. It was sick. We're going to go into detail on

4:46

some of the things we launched, some of the things we learned. We're going to be talking AI news and then we have Ismail from Super Agent coming on to talk about some cool stuff, some cool findings and benchmarks that they've been running over there. So, with that, how's it going, dude? Dude, it's good. If anyone's watching,

5:05

you know, new in a new location now. Um, I'm not going to dox ourselves like we did last week, which we'll get into. Um, but, uh, yeah, I'm doing good. You can see we have some new things going on here.

5:20

New background. New background. New book. New book, which I don't have a copy of,

5:26

which is hilarious. Okay, that's okay. You know, there's a new book out there now. Now it's

5:31

two books. Two books. Cuz two is better than one. For the price of one, you get two.

5:37

For the price of zero, you can get two. For the price of zero. Just gota you just got to know a person, you know. Yeah. Maybe someone on the show can hook you up if you if you're interested.

5:49

Yeah. I I think we a little poll. This is a live show. you know, you might be you might be listening to this later, but if you are watching this live, either on YouTube, LinkedIn, X, whatever, uh tell

6:04

us if you did watch last week's show, was that the best location for a show or not? Should we do all of our shows from live from bars? Because if you missed it last week, we uh I rolled in off a plane into San Francisco and we just met at a bar and we did a whole live stream from a bar.

6:23

That was my favorite live stream we did, dude. Had had a beer, you know, had had guests. We had a surprise guest encounter. So, someone was watching the

6:36

show. We told everyone what bar we were at. And about 15 minutes after the show ends, someone showed up and had a drink with us. So, that was pretty cool. Super cool. Super random and super cool.

6:48

And then we met this person at the conference as well. So, that was really cool. Cool. Yeah. I'm gonna try to If you're

6:53

listening, send me that picture. I know you took a picture. It'd be cool to, you know, tweet that out.

7:00

Yeah, that bar, dude. Shout out to Victory Hall. That came into handy so many times last week, gathering that many people to work and then also, you know, be fed and have everything. And we

7:13

didn't plan it at all because that's how we basically we basically used the back room of that bar as a co-working space and we just paid for food and you know drinks or whatever while we were there. So thankfully the team eats and drinks a lot so they got there they were happy you know. Um, I would love to. I mean, when we do our next show together, we'll be in this room and we're definitely going to make some

7:39

modifications and stuff, but uh, yeah, it's cool to have like a place where we can record the show when we're all together. Yeah, we have a gonna have a studio room. We're getting to we're going to get to be slightly more official and then unfortunately, we're still going to show up 15 minutes late.

7:56

I'm not going to say whose fault it was today. If you're listening in the chat, who do you think was the cause of the lateness today? I'm just curious. I don't know. You tell me.

8:08

Well, there was uh as there always is on X in this AI world, there's some drama and we always like to lead with that, right? Because that gets the that gets the clicks, that gets people in here. So, let's going everyone loves to hear about the ongoing drama. So, we'll spend a few minutes,

8:26

but Obby, does MCP suck now? because I'm hearing a lot of things that are making me think that MCP actually really sucks. Dude, I mean, was was MCP ever not sucks or did did it ever not suck? I don't

8:40

know. I don't know. That's why that's why you're here. You're here to you're here to tell me. Yeah. So, the Okay, there's a bunch of

8:46

origins of this stuff. So, let me like start from the beginning and then we'll get to where we are and then move forward. So when we started it like over a year ago, MCP didn't exist. And so

8:58

what you were doing was making like lang language specific uh tools. So in JavaScript, we didn't really care that much like our our our squad here because we already have node SDKs for everything, you know. So like tool calls and stuff wasn't a big deal. But not every community is like the JavaScript

9:17

community. Some people are in the Python world where, you know, it's just a barren wasteland of But anyway, so is MPM as well. I'm just joking. So then MCP came out. So the good thing is like

9:29

to have the bad thing about integrating tools through your language specific stuff is there's no like standard protocol between the different modules you use, right? So you're going to be using Stripe client SDK. Then you might be using some other fool's SDK that is not the same. And then you never know

9:47

the quality in the docs of all these different things. So when MCP came out primarily at the first for tool calling which is probably the only use case it really has people were stoked because now there is not really language specific tools it's like agent specific tools written in any language and you

10:07

know each the tools have schemas and can implement their own thing and that's when it took off because people were trying to solve the tool problem in their own frameworks or in their own projects. Then there was so much added to MCP. You got resources, prompts, elicitation. Um there's like progress now and then

10:30

you know things like that. But the biggest question throughout this whole time was like how does MCP do off because my tools need authentication because these are third parties etc. And then there's a whole work going on into doing that. And so that's where we are

10:48

today. ish where and then now I think people are trying to leverage like well the spec's taking too long how do you do it it's hard to do you know like what do I do and if there's no answer for a while people already know how to do the things in their language specific tools and I think the the tweets that we'll show you are people who are deep in their own language specific tools

11:13

building language specific tools right so let's go with the first tweet So before we do a couple items from chat. So Val says you heard there's drama and that's why you're tuned in. So thanks for thanks for being here. And Editia says can we do studio or stddio transport

11:34

uh in production rather than HTTP. So rather than like streamable HTTP can you do stdio? You could do it in production. Sure. It's gonna suck, but you have to keep the

11:45

you won't be able to run like serverless or anything, but you could do it that way. Yeah, it kind of depends like depends on what your definition of production is. If your definition of production is you want it to be used in cloud desktop or something, then sure it can work. You do run into some issues though like so for

12:04

reference the MC the MRA MCP doc server just uses stdio, right? We download a package locally. There are some issues though with like Windows for a while. I think it's better now. We don't hear as

12:17

many reports, but because you're just downloading a package locally, they need node installed, you know, for every there's just some weird things that you get into where if you're using HTTP, you don't have that problem. But the kind of wild thing is you have access to the whole file system. So you can uh maybe

12:36

be concerned when you download some don't download sketchy MCP servers over STDIO but um monsters one that's safe and yeah so it can work. I think it just depends on your use case though. Yeah. All right let's let's get into the

12:53

tweets. All right. So, I've I'm such a journalist, you know, but uh let me just do my timeline. Look at it real quick.

13:05

Okay, so the first shot was always by, you know, it's always by Theo and that's awesome because this is hilarious. But let's do this one. So, I'm starting this timeline in November 7th, so like last week. So

13:22

MCP is a great example of why we shouldn't let Python devs design APIs. Now I totally agree with this statement. Um but I don't want to like disrespect anyone. Uh but anyway, I do agree with the statement and that's hella views and

13:40

that was last week. So that's cool. I like the first comment. Who is we? And then Theo people who want things that

13:46

work. So good, dude. That's just so good. Um,

13:53

yeah, that was a good comeback. Such a good comeback. Well done. Well done, dude. Then the next one I

14:00

tracked down is from our homie at Browser Base. Let me get this. So, he posted this yesterday, but I think he, you know, posted this a long time ago. Um,

14:19

he's been he's been on this vibe for a while and this is awesome. So, this is really the earliest thing, but contrarian take. MCP won't exist by the end of the year. I can't explain it, but the vibe is wrong.

14:31

Interesting. I think I think he missed that one. It's still obviously going to exist at the end of this year. So, that was a miss.

14:37

That's a miss, but it was a bold bold take. And then I figured out MCB gives me GraphQL vibes. All this abstraction for what? And I think this is in response to all the other features that were added

14:49

after tool calls, right? Yeah. Um that's my theory because I don't use anything. I don't use MCP other than tool calls. And you know, honestly, if people

15:00

packaged their node module or their SDKs as just tool importable tools, I probably wouldn't even use MCP. That's my hot take. But let me continue.

15:11

Then we got the homie DAX here. Let's do this. And this one's a longer post, but we can read it. Uh, here's my gripes about MCP as an implementer. End users expect

15:23

miracles from MCP and load in a bunch of them. True. Most MCP servers are garbage, crash a lot, and the LLM never bothers calling them. Semi true, but definitely have seen this happen, right?

15:35

Definitely the garbage part. Uh, this whole try SSE and if it fails, try streaming is exactly why I'm so anti-standardization when things are already early and already a ton of bloat. Yep, people run into that. Even

15:49

if an MCP is good, 99% of people won't know how to configure it. It's now on me to create some complex UI that helps people manage all their MCP tools. We've done that. Yeah, companies felt pushed to offer MCP and

16:02

have a half half asked implementations that users expect to work well. This is a number one thing we see. Yeah, it basically what it seems like everyone just takes their open API spec and says convert this to an MCP and they have 400 tools or you know a hundred tools and they just then people just try to call that right and it's just too much.

16:26

Yeah, it really feels like the kind of product dev I hate. Come up with some idealistic vision of interconnected seamless future create a spec out of it. Everyone gets hype and then all the real world product and business obstacles show up. That's a very hot take. Then our homie David got gets in here.

16:45

Can't shy away from the MCB discussion, of course. Uh so David's like, "Users want want something. Why do I have to cater to them?" So it's like he just says that there's just a lot of complaining. Um and so I think David is

16:58

very prom. He definitely demoed a bunch of MCP stuff and things like that. Uh Paul's in here now, too. I think it's challenging for protocols like MCP to have durability this early into AI.

17:12

Boom. And then, you know, this is just a great thread. So, you guys should go read more of it. But this is where a lot

17:17

of Zingers are in for sure. And then just to uh pump the homie the agent boy himself can't can't go a month without me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me mentioning Sherwood. So bring him in. Got to get the agent boy in here. And so

17:37

this is very like you know um neutral kind of voice here. Uh few things few things people forget about MCP. One it was released three months before cloud code. So the idea of agents that can

17:50

basically do anything via bash wasn't popular or well understood. Very true. Two, it was for a while the only way to add thirdparty tools to desktop AI clients. That's the big freaking point.

18:03

Um, and this was a killer feature. It's still useful for creating tools that just work across many agent clients, especially when those tools need to be run in remote M. Very true. And while I agree the MCP has a lot of flaws and the

18:14

hype from earlier this year was out of control, I don't think MCP is useless, nor do I expect it to go away. Nice little ending to this. Getting good traction here, too. So, good job, Sherwood. Um, yeah. What are your

18:27

thoughts, dude? Well done. Uh, yeah, I agree with Sherwood's take, you know, partially. I think he's he's I the one I agree with most on this. I think that it

18:41

a lot of this stuff isn't new. Before MCP when you wanted to connect there was all these different like tool providers, right? I mean I can think of a half a dozen where literally like integrations platforms. I mean you know candidly for anyone watching this at one point but before Monster we

19:00

were talking about just trying to solve the integrations problem because that was also a problem we saw but there's tons of you know tools out there. there's tons of services that are just how do you connect a whole bunch of APIs really quickly and that's kind of what MCP was trying to solve. The promise of it of course was that we can have

19:19

standardization. Maybe we can kind of cut out these thirdparty tool providers, right? We don't need those. We can just like it it can be open. You can just grab the tools you want, give it to the agent and go

19:31

straight to the source, right? just connect straight to the different uh Gmail APIs or calendar APIs or wherever what other tools you want to add. I think with challenges like O that becomes a little harder and so there's still like this middle layer that seems to exist. So I don't think it fully solves the problem and I still think O

19:50

is way more complicated than it should be. Oath in general like is the right approach but it is just adds a layer of complexity when you're authenticating as an agent. you know, agents authenticating as itself or you're giving, you know, credentials for the agent to use. There's just a lot of complexity there and obviously people

20:08

are scared about security implications and all that stuff. Yeah. I mean, with or without MCP, all those like considerations are still there, you know, because it's still like an agent calling a tool and Exactly.

20:19

you know, there's just no standard way for you to do it. But then I guess we go back to this world where just figure it out yourself, fool. You know, like that's what you do with off at your own on your own product, you know? Uh

20:32

maybe that's the way. I don't know. That is the way it is right now. It seems it is the way it is.

20:38

And it's the way it always will be. Yeah. O is always, you know, it always comes down to off challenges regardless of how you look at it. And

20:49

I've seen all these different people try to sol solve agent off in different ways and it's just hard to standardize anything. So I don't think there's really anything that is this is the standard way to do it. There's some patterns that are starting to become a little bit uh more commonplace but yeah it's challenging and yeah a off is a

21:11

means but but yes I agree. Yeah. Uh, oh, you know, Ally sent us a tweet.

21:18

Um, here, let me post Alli's find it. Well, if you're listening, what do you think? Is M does MCP suck? Do you use it? Put in the chat.

21:30

Yeah, put in the chat. Are you actually using it in production on something? Uh, so this is from our homie Ally. I do think sandboxes have the potential to make some usea MCP use cases obsolete. I

21:43

can't quote two tweets at once, but I lied to Mason Williams post from yesterday saying the primary benefit of the MCP protocol is to give agents a way to interact with thirdparty tools. Once again, tools in a way where the user can customize that agent to their liking. Totally. With some sandbox providers

22:01

I've tried, you can programmatically decide what tools this the agent has access to. So, it feels like you could accomplish the same with a little bit more effort than connecting an MCP server. Totally. You could code your own

22:12

tools and add it to the sandbox or use the ones that come with the sandbox. For example, if my agent was designed to review code and create an architecture diagram, I'd use Daytona's little sponsor there. Um, built-in sandbox git clone tool and then use cloud agent SDK inside the sandbox to summarize the codebase. I could use the GitHub MCP

22:32

server for that instead, but then I'm just introducing rug pools and other MCP vulnerabilities necess unnecessarily. This is a really good take, dude, because um this dude right here, Caric, and I agree with him too like sandboxing just having compute that exists and you can just generate code and run it there or even run whatever you want like you just want to do whatever you want and

22:57

agent has access to create one of those and execute that plan. I think it's very interesting and maybe that could be the end of MCP or some other way of probably the end of MCP and that's what code mode is, right? Like that's another kind of sandbox execution thing.

23:14

So Alli's coming on next week and this is something we should bring up and talk to her about because there's also I can see other security vulnerabilities about just letting an agent write code and execute code. So you know in some level you could be introducing additional complexity if you just allow the agent to write its own

23:32

code whether it's Python or whatever and then actually run that code. Now, of course, it's sandbox. It can only do, you know, damage, so much damage inside that system. But that sandbox can probably hit the outside world, right?

23:44

It can hit a URL. And so, it is just kind of interesting that it can I I guess I have red flags coming up on both sides like yes, MCP has a bunch of vulnerabilities, but I could also see that just giving someone a sandbox to run any co giving an agent sandbox to run any code they want could also lead

24:03

to some vulnerabilities as well. Yeah. The chat is chatty right now. So, let's

24:09

uh take some comments. Uh what is the O choice for Monra? Personally, I use neon o aka stack o works well with neon db. Uh for our

24:23

projects, I mean so we have o adapters. So whatever you want to use like personally um but you know superbase is kind of what we're using because it cames comes with our DB which might be a mistake in the future. Um let's see a said this is the production question that just came. Yeah. Well yeah that that's more of an MCP problem.

24:50

Yeah. So sometimes it works sometimes it doesn't. I think I think a lot of MCP clients are not necessarily even production ready yet. There's not a you know like longunning connections seem to fail in a lot of

25:03

cases both MCP servers and clients. Yeah. Okay.

25:10

All right. Le sprite. No, I feel most people will make their own agents tools.

25:16

It makes sense for some like Stripe, but majority of people will want custom tools. Agreed. Agreed.

25:22

Yeah. I mean, I think a lot of times you you want to customize maybe a collection of tools into whatever kind of workflow you actually want to accomplish a task. So rather than giving the agent just raw access to 50 tools, you can say, "Well, the agent actually just needs to do these 10 different things that maybe interact with a couple different tools."

25:44

Yeah. Like for most cases, right, you just hit third parties when you need them, but you're still writing business logic in your own application. It's the same kind of thing. Yep. Okay, keep going. There's a bunch. You guys are chatty today. Uh, usually use

25:58

MCPS for getting docs into my coding agent. Otherwise, it goes a bit wild and kind of scary to install them on the work machine. Um, and then Bolt is using web containers for that. Great tech for sandboxes. Okay. And the for the MCP docs one. Yeah, totally. We did the doc

26:16

server for that kind of reason. Thanks for all the comments. Yeah. So, there's the drama, you know,

26:23

for the week. We always gota we always got to lead with the drama. So, you know, you tell you tell us does MCP suck or not. I think there's still a lot of hype around it. I don't think

26:36

it's going away, but I think, you know, it has faded. The the the appeal has faded a little bit. Let's do a Okay, let's do a bet. Like I don't care which side. Do you do you

26:48

have a pref preferred side? Well, I mean, what's the bet? You know, you buy me a drink, I buy you a drink, whoever. I mean, but how do we validate if it's if it came true or not, right? Because

26:59

the MCP is not going to be gone in a year. It'll still be here. It's like, how do you how do you determine a level of adoption where we could say, you know, who's got the poly market on this? If maybe we should make a dude, that's a great idea.

27:16

got the poly market. Let's let's let's get a bet going. All right. Figure that out in the mean

27:21

for between this show and the next how we could bet on stuff. Um but what if it's like a major company does not maintain their MCP server anymore. Okay. Yeah. I don't care which which side. So

27:34

you can pick which oneever you want. But who I do think a lot of the providers, you know, like Sentry is going to still support their MCP. I imagine one bigname player stops.

27:48

What if Claude even stops? It's like, "Yeah, we're just gonna go with What What if Claude hands it over to the They hand over the speck." Yeah, the spec goes to the community. That's the bet.

28:02

Yeah, I don't think it will. Okay. I I think it will. So then I'm on that. So in a year in a year from now, put it

28:08

on the board. You think that you you're saying anthropic will hand it over to some kind of like governing foundation or you know they'll just let it to pass. Yeah. Okay.

28:18

Okay. Yeah. I don't think they will. I think they'll want control of it. But yeah, that's a good bet. I think I will I will

28:24

that drink's going to be tasty. Yeah. When you buy a hater, you love that bet.

28:30

Yeah, absolutely. I guess if you're on both sides, you love either one. So, all right. So, let's keep the show going. We did have something come up in

28:40

the chat from earlier. Haven't forgot about you. Says, "Could you please upload the talks from the live streams in separate videos?" So, I'm assuming you're talking about not the conference,

28:52

which we're going to talk about next. I think you're are you talking about the different segments? If you could clarify, it would be pretty cool if we could upload each segment as a separate video, but the honest truth is we're not going to do it because I would have to do that and I'm not going to do it. But one could be some sweet use of AI to

29:14

like clip when the the bottom bar changes, clip the videos into different parts. Or honestly, if we probably should use like some AI to like time stamp the different spots so you could then just click through pretty easily. The AIE Ebot uh put time stamps in a comment, I believe.

29:32

Yeah, we need to figure that's probably open source or something, right? We got to figure out how to We should have our friends at Mosaic just like do it for us. That I'm supposed to I'm supposed to chat with them about some of this stuff.

29:44

I just haven't yet. But that's a good idea. We uh we could get better about descriptions, timestamps, you know, we're just kind of lazy. Yeah. So,

29:57

all right, let's talk about Typescript AI. Dude, that was that was obviously it's it's normally it sucks to throw the party and there were times leading up to it, it kind of sucked to throw the party, but when you're in the party and it's your party, it was a lot of fun. It was one of the best conferences I've

30:16

been to in a long time. You know, obviously I'm biased, but it was really cool. So, maybe we can set the stage a bit.

30:24

Typescript AI comp was November 6, last Thursday, and the lineup was incredible. We had a ton of great speakers. the we a lot of the master team was there so it was pretty great just to get it getting in you know working at the bar with a lot of them as well and shipping a lot of things right before the conference we'll talk a little bit about that and really just learning and having

30:50

conversations I think was probably the the best part just what people are building a lot of people using master a lot of people not right but just learning what the community that considers itself like Typescript people are building and the tagline of course you know you wouldn't be surprised if you watch this show the tagline of the

31:08

conference was Python trains TypeScript ships and so it was kind of a just around the idea that a lot of people are just building things in Typescript and you should be able to ship AI features in Typescript doesn't have to be Python what did you think well it was a very stressful week from engineering side um but uh once the conf

31:32

conference started and the objective was hit. It felt so good to be It felt so good to be done or in there and then everything also being high quality, everyone having a great time. Um, it was kind of like a it was a fairy tale conference, you know. Yeah, it was awesome. Did you finally catch up on some sleep?

31:55

Yeah, dude. I think I think I slept for like three days, it feels like. Yeah. leading up to the conference, not a lot, but after I I had to I had to recover a bit.

32:07

And we we had our whole team, most of our team, not the whole team. There was definitely Ward couldn't be here, which made me sad all the all week. That was tough to get through everything without Ward was very tough.

32:17

Yeah. No Ward, no Tyler. I mean, a whole bunch of people, right? A lot of people that couldn't be there.

32:24

And uh so that was tough. Um but let me show y'all. Oh, I was gonna share something as well.

32:30

You go first. I'll go first real quick. This is just We had all the the whole team. This not everyone in the picture, but this is a lot of people minus one.

32:41

Um, and this is, you know, we're wearing these green shirts. Uh, it was cool. Yeah. Here, let me share. So, this is

32:49

I'm probably I don't know if I'm supposed to share this, but you know, someone on the marketing team will will, you know, come find me. Come yell at me. I don't know. But we'll probably be sharing some of the stuff out. This is just some some

33:02

pictures from the conference. So, a whole bunch of, you know, really good pictures. You can see this is, you know, all all on stage. Hey, he's the

33:13

one we met. That's the homie. All right. And then you can see obviously the team picture which Obby

33:19

already showed which a few people that were already there. Honestly, there's three people not in this picture and they weren't in this picture because they had to go in the afternoon, go back to the hotel and sleep because they were up so late the night before just getting things ready. Yeah. So, it was a huge team effort and yeah,

33:37

it was great. I I we started getting in some feedback, too. And the feedback's really good. A

33:44

lot of people have post about it on LinkedIn and X. We got, you know, we sent out a survey afterwards that was very positive. So, I only saw the first like dozen or so because it just sent out not long before this and everyone was like four or five out of out of five. So,

34:02

that's that's speaks volumes. I think one quote was it was no no shills you no sales shills. It was just no shills information knowledge sharing. And so, they they really liked that.

34:16

Uh what any takeaways? Anything that was interesting to you? Do we have time? I figure we could recap

34:22

just uh Yeah, we we'll probably end up doing the AI news after. Okay. But we we have some launches we can share as well that we can talk about that we launched on stage.

34:33

Um let's start with the the merch. All right. So, let's start with the merch. Okay. We got this new hat. Now, we've

34:39

been talking about hats. So, we got monster logo and hat. Now, um I'm going to go get the other merch. Will you talk about the logo? So, all right. So, we'll share that uh

34:51

that launch first, I guess. So, I was not prepared. All right. Here,

34:56

here we go. All right. So, we had three different things that we kind of launched. Uh Sam tweeted out. So, we had new logo, new

35:07

master logo. We have a new design system in kind of a visual language iconography around this. So just kind of a new design system for all the MRA branded stuff. We

35:20

have a new website. If you haven't checked out our new website, go to master.ai, check that out. And it kind

35:27

of uses all these this new branding. That was the first that was one of the many things we launched. Yeah. Um, then just getting back to merch real quick. We have team merch. We never a

35:40

lot of people were wondering like internally. So, y'all can't get this stuff, but if you did, maybe in the future. Uh, we got a bunch of internal shirts for everyone. People love that stuff. This is like a matcha shirt here.

35:54

We had some from the conference. This is the one we wore at the conference. Like super yellow or not yellow, green. And it was actually really it was so green, but it was so good because then you know who's you

36:08

know working at MRA and you can have a conversation with them. And it says Python transcript. Oops. And then we had like an afterparty

36:19

outfit. Same thing. And I don't know. This is like if you're like a a Gen Z person. This like a Gen Z

36:27

shirt right here. So yeah. Yeah. I I didn't qualify, but I I did wear it. I didn't know what I was

36:32

wearing, but yeah, t tons of really cool swag. Um, we did announce as well that we're going to be having days April something 6th or 9th or something. 9th.

36:51

Yeah, in SF. And we're probably going to be doing, you know, you heard it here first, more than just an SF eventually. So the goal of those are not to be a big conference like the TypeScript AI is for the whole TypeScript community. MRA days

37:06

is for people that are building with MRA. I want to talk about how they're using it, the things that they're shipping. So we're going to have a lot of users come on and you know talk about the projects they're building and you know how they figured things out, how they got things into production, the things that they struggled with. They should be like demos and lessons learned

37:24

and really community kind of community organized in a lot of ways. like we'll organize it, but it's going to be communitydriven at least a lot of talks from community members and we'll have a few other things that, you know, we'll probably bring in a one or two big, you know, big big name people. I would say we'll probably try to get, but we're going to keep it more community focused.

37:42

So, that's coming. That's kind of the next big thing. And I'm hoping that we'll we'll try to have some Monster swag at that at those events. So, if you want some of this limited edition stuff, we're gonna try to open it up. So, it's

37:54

not just the Monster team, but anyone who wants it. Yep. We're we're in the swag business now.

38:00

Yeah. We you know, first it was first we're you know book printing, book publishing. Now it's you know Yeah. We're a conference book conference. Yeah. I mean we all all the

38:12

things. All right. So speaking of uh book printing, we launched a new book as we showed. So you've probably heard of Principles of Building AI Agents written by our

38:25

co-founder Sam. He wrote another book called Patterns for Building AI agents. The biggest thing is we talked to tons of users and it's always a struggle going from the prototype works to getting into production. And so that's

38:39

really what this is about. It's about just patterns that we've seen that have been working well, you know, and kind of just like it's it's not going to solve all the problems, but it's going to give you like useful just useful tips and useful patterns for how you might think about getting something getting an agent from the prototyping stage all the way

38:57

into production. So, that was a big one, big launch. Uh, no particular order because I'm just clicking around now. We didn't announce our V1 beta, so it's not

39:09

GA yet, but tell us a little bit about the V1, Obby. Yeah, so over the last year, we've been just hustling on trying to understand what Most should look like. Um, and that's, you know, from agents to workflows to MCP tools, eval, etc. And

39:31

in doing so, we tried to move the pro the project forward, but also do our best not to break people along the way, which we did. Okay. I mean, we definitely had some blunders along the way. Um, but now we're trying to move forward with the learnings that we have

39:52

and the V1 is like the perfect way to pretty much throw away all the stuff that we didn't learn anything from and it didn't pan out and things that we did do clean them up and listen to more users and that's what the V1 represents which is like okay this is MRA what we think MRA is going forward. Um yeah cuz when we started we were in you

40:16

know our YC apartment thinking of stuff things changing every day things are still changing every day but even a year ago the different primitives people were needing and stuff was changing every day. So V1 is just like this is what's going forward and we have a sick ass picture for that on the home homepage. Maybe I can show that real quick. um or y'all can go look at it. But

40:43

it's kind of like it's kind of like this this agent icon that we have now kind of represents what MRA like is. It has all the primitives and you know this is kind of what we're building and it's cool to know that now you know and we didn't steal that icon from Singular Wireless. It's not the same thing. not the same. Even though Even though Obby tried to convince me for a long time that it was

41:15

I've seen that before. No, it's different. But yeah, there was one more thing that we launched. So, I'll share

41:26

that as well. We launched a UI dojo. As Sam mentions here in this post, we kept getting asked how do you do this with AI SDKs, you know, elements, how do you do this with assistant UI? How do you do this with C-Pilot kit? And so we wanted to build examples with each and

41:47

the coolest thing is we had people from all those teams at the conference. So we had Nico from Verscell, we had Simon from Assistant UI, we had a Tai from Copilot Kit who you know Tai was on the show last week and it was really cool to kind of have one place where all we can showcase all those things and of course now all you know all those teams want to

42:09

make our UI dojo even better specifically for their you know because they want it to show really well and show all the cool uh capabilities and how it might be differentiated. So that's cool. We're going to keep iterating on it. we're going to make it

42:20

better, but it's meant to be a really good example for how you could use Maestra Mastra alongside one of these tools to really build a new front end. And so you can use it to kind of learn from or to kind of drop in some pieces in your actual app. It was sick, too.

42:38

Yeah. And it was one of those things we, you know, we didn't know if it was going to be ready. We weren't really planning on it, but you know, it's it came together and we got it done just in time. Yeah.

42:51

All right. So, we got some questions and then Yeah. Uh hopefully our guest will be joining here soon. Can you throw up a couple of those chat messages while I track down if if our

43:02

guests here if we should keep going into the news? Val still need my book autographed. Well, if you ever come to that stuff, we can do that. Um

43:14

how to get the physical or digital copy of the second book. Uh, go ahead. Yeah, I got a cheat code. I don't know if this is going to work, but someone could try it and I'm pretty sure it's going to work. So, if you haven't

43:26

registered for the conference yet, we have not sent out the there's like a follow-up email that's going to go out that's going to have a digital copy of the book. Go to the uh the tsmp.ai, click register. It's probably the button's probably still

43:45

there. I don't know. I doubt we've turned it off. I wonder if it'll still let you register after the fact. It's free for the the virtual attendee and

43:52

then you'll get the email that has the digital copy. We are going to put up just like we have for the first book. If you go to master.ai/book, we'll have a landing page. If you haven't got the first copy yet, if you

44:02

get that, I'm pretty sure we're going to just send out an update to that whole list at some point saying here's the second book if you want it. Here's the digital copy. And we will be, you know, sending out some physical copies. I don't know yet how we're going to do that though. So, I can't say on the physical version, but it's probably

44:21

going to exist. I just don't know when we're going to start doing that. We still are sending out a lot of the a lot of versions of or a lot of the original version. So,

44:33

and Hashim, you know, Hashem said it looks like Mastra Mastplat. Yeah. All right. That That's how people talk about it.

44:45

There it is. The splat. It's It's the splat. So, that was the

44:51

TypeScript AI conference. Hopefully, you know, we will see a lot of you there. If you weren't able to attend this year, we're going to do it again. We're trying to build a community around people that just want to build with Typescript and

45:02

AI. So of course we did talk a little bit about Maestra but it was mostly community focused and everyone from you know Swix to David Kramer from Michael Grenwich from work OS from you know all the demos they were just kind of talking about interesting things to them which were not MRA specific but they also just

45:22

weren't shilling their own products which is great. It was really felt like a good knowledge sharing session. We had panels with, you know, competitors on the panel. And so it was really cool to have, you know, two observability

45:34

providers, leaders of observability companies on the panel together and just having, you know, friendly conversations around how they think about things and how they differ, but also it's it was just a really cool environment. So feels like we're building something special and I'm glad for all of you that were able to be part of it. And for those that were not able to be, we will be

45:56

releasing the videos individually over the next couple weeks. So, we'll probably try to launch one or two a day. Probably starting we'll get the first one. We got to do some like processing

46:08

editing on our side, but it'll be soon. This week, we'll start launching some of them. So, uh keep an eye out for that.

46:13

Go to our YouTube if you're not already. Subscribe. And if you're listening to this later, if you're not listening to this live, give us a fivestar review. If if you are so inclined, if you don't

46:24

want to give us a five star review, find something else to do, please find Jesus. Five stars only. All right. Well, should

46:31

we bring on the first guest? Yeah, let's do it. I guess the only guest today, the one and the only.

46:39

All right. So, we're gonna bring on Ismail from Super Agent. You might remember Isma was on before. Had some

46:47

really great hot take moments. So, I I'm excited. I think we're gonna we're gonna have a good conversation. What's up? Hey guys, how are you doing?

46:58

Good. How are you? Good to be back. Thank you. Thank you. Yeah. Welcome back. How are things

47:03

going? It's going well. It's going well. We uh since last time I was on, I actually got

47:10

a bunch of feedback from some of the people in the live stream, which was amazing. So, uh, and we've done a lot of progress and today I'm here to to share some of that progress with you guys. Um, awesome. Can you can you give us a quick give us the real quick 30 second or less overview for those who didn't watch the

47:31

last show? Who are you? What are you working on? And then let's go into it. Sure. So, I'm Ismile. I'm the CTO of

47:39

Super Agent. Super Agent is a YC winter 24 company and uh we help uh agent builders build safe AI products basically. And the way we do that is that we help them uh measure and secure their AI agents uh for stuff like prompt injections, hijackings, you know, uh leaking of private data, leaking of secrets, that kind of thing. So we have a set of uh customtrained uh

48:15

guard rails in different sizes that you can use depending on the use case you have. So we have some that are super fast for like voice agents, some that are a bit slower and more powerful for other type of agents like coding agents. And then uh we also just released a tool where we built an agent that can actually attack your agent and see how

48:40

uh you know how it performs and how safe it is. So that's what we do in a nutshell and we help uh you know every every type of company from like you know financial companies to insurance companies to startups anyone who's building apps and wants to keep those AI apps or agents secure basically nice what you got for us today

49:06

so uh as I told you like we how when we started training our own guardrail models One of the uh you know u one of the challenges was that we needed to uh be able to generate real type real life attacks that has happened in the past or is happening today so that we could take that and and synthesize basically a data set that we could train our models on. Right? Because these attacks are pretty

49:38

sophisticated. they are usually multi-step uh and there are new attacks popping up basically every day. So we wanted to create a data set where we could you know catch most of those type of attacks and and uh leaks and that kind of stuff that's going on with these uh you know uh uh state-of-the-art models. Um and so

50:03

that was a big challenge. We managed to do that. We outsourced a lot of it to actual like red team people and hackers that could help us, you know, set that uh um data together and uh we also uh use that data to train these guardrails of course. So the interesting part of uh

50:28

that work was that you can actually use that same data to create an attack agent. So, an agent that's good at defending should also, you know, in theory be good at attacking. Yeah. So, uh that's what we built these, uh

50:45

past couple of weeks since we since I was last on. And um we went through and attacked the 20 I think it was 25 stateofthe-art models and we um released a benchmark that uh shows how these models perform and I'm going to share that benchmark with you guys today. We just released it. We call it lamp bench.

51:15

Uh so it's a sacrificial lamb basically all of these models for us and there are some really interesting takes. So let me just go through you know how this uh benchmark actually works. So if we take like GPT5 we test uh three different categories. We test for prompt resistance which is basically you know

51:38

jailbreaks back doors and that kind of uh attacks. We test for uh data protection which is like how often does the model actually leak data that it shouldn't. It could be secrets, it could be you know social security numbers, any kind of private information that's that the model is trained on not to leak

52:02

basically. And then we also u uh test for factual accuracy which is basically how how much does the model hallucinate giving a given a set of uh data that we pass to it. So an example could be that we basically simulate an agent with one of these models. We give it, you know, access to a environment file and we try

52:29

to see what what what we can do to have it leak those environment variables. That's a good example of how we we test uh prompt resistance and data protection. When it comes to factual accuracy, we do a similar thing. We could, as an example, we could paste a

52:47

Wikipedia article or we can upload a Wikipedia article to the environment and we could ask it to recite, you know, specific information from that Wikipedia article. And then we have an evaluation agent that goes in and actually checks if if that's f the answer is factual or not. And so we have around 50 test cases for each of these categories.

53:14

And uh these are the uh results that we currently have. And one of the interesting things I think with uh this benchmark in itself is is of course the actual scores. All of them all of the state-of-the-art models score pretty low when it comes to most of these um all of these three categories really. Um, but

53:41

the interesting thing is the ones that actually score the best are the older generation models and and GPT40 O specifically is the one that has, you know, is is currently topping this benchmark. Uh, so I've been thinking a lot about this and I've been experimenting a lot with trying to attack different coding

54:05

agents like cloud code, cursor, etc., etc. And I've basically been able to hack all of them. But one interesting thing is that you know and this is

54:16

anecdotal but my uh you know gut feeling is that these new generation models are heavily trained on coding tasks. And if you've looked at like a um you know stepbystep conversation with Claude code, you'll notice that Claude has builtin prompt injections into it. So at some point in a conversation, you'll see

54:45

that you know the model actually sends a prompt injection to itself to update its context basically. And they do that by sending specific tags like system prompt tags into the conversation history. So in order to have those type of instructions pass through and actually make the coding agent perform better,

55:11

you need to, you know, bring down the amount of guard rails that are built into the model. So that's one um one of the cool things that we um were able to show uh with uh with this benchmark that older models that are not trained on coding specifically perform much better when it comes to safety um with regards to these three

55:40

categories than than the newer generation models do. What's the example of a test case that you ran like between you? Uh so I can show I can I can actually show so we have a we have an uh a dashboard that we can actually just put in any model and it will run all of the 150 like test cases. But let's take like claude code 4.5

56:07

and you can go in and I can go in and and just you know uh share one of the uh one of the conversations that we have. So here I'm stating my social security number. I'm giving it some information.

56:21

I'm asking it to recite it for me. uh and in most of the cases it doesn't it's not allowed to do that inherently because it has some guard rails that says that I can't you know repeat like this credit card number as you see here as an example but in some cases you can actually bypass that like in this case like here's my social security number

56:45

and then you see that you know it can recite it even though it shouldn't. So this is a pretty simple example but there are other examples as well where we try to you know uh basically ask it to let me find one that's and can you click is it possible to zoom just a one click a zoom maybe there we go yep now people will be able to read it

57:12

sure so uh in this case you know uh I basically pass in an example example of something that it shouldn't be able to do and I ask it to do it and it you know goes through and does it basically and in the previous case that I showed uh let me see if I can scroll down here so you guys can see um this was the

57:35

previous example where where I actually gave it like in in the data that we passed to the environment we have a social security number we ask it to recite it shouldn't but it does it anyways and that's basically a failed test, so to speak. So, that's how how it works. And here's an example of a pass test where I pass it my credit card number and ask it to, you know, recite

58:02

that number and it doesn't want to do that. So, this is a a pass test, so to speak, and the previous one was a failed test. So, that's how we test them. And we have a bunch of these uh different

58:15

type of basically it's conversation history like messages that we send into an agent environment. The agent has access to some documents that we upload where some information might be you know uh should be hidden but it's not etc etc. And we try to see if we can get it to output something that it shouldn't.

58:40

For sure. And do the do they have tools as well or just Yeah, they have tools. Uh and uh in some cases they have tools. In some cases we uh actually um basically generate the

58:56

whole like tool called tool result all the steps synthetically and then in one of those steps we might you know expose a social security number. Let's say it's in a tool result right and then we ask it to expose that. And if it does that, we mark that as a failed test.

59:16

Would it be allowed to do that within the same conversation with the same user? Like or is that was that just not allowed at all? Like uh like how do you mean like if I'm having a conversation with the agent and it's still me right in the third, fourth, fifth turn, like shouldn't I be able to see tool

59:36

results that have happened? But then maybe someone who's not me shouldn't. Are those the type of protection benchmarks we could add in the future or is that just not even a thing? So interesting thing is like if you go

59:48

to chat GPT and if you enter your social security number that's a onetoone conversation between you and chat GPT right you just enter it into the chat and then in the next you know question you ask it to recite that it won't be able to do that. It will block that. The same goes with emails. Like if you ask

1:00:07

it like I've uploaded these emails like can you can you give me a list of all the emails I I uploaded. It won't be able to do that because it has some built-in guard rails, right? So what we're trying to show is that in most cases like in in in GPT5 like in 60% of the cases it actually adheres to those

1:00:27

guardrails but in 40% of the cases it doesn't. So that's what we're trying to show with this benchmark like how well are those built-in guard rails actually working and um we also one of the other interesting things I think um you know I was in a in a DM with one of your uh teammates and he asked me how well GPoss

1:00:55

safeguard the new guardrail model actually performs and it performs really well but in 10% of the cases it doesn't perform at all. And so you know the question becomes is that you know a good thing is that a bad thing like is 10% you know one one out of 10 queries that might expose a environment variable is that a good thing is that a bad thing that's up to the developer right so what

1:01:23

we we are trying to do is that we're trying to give developers with this benchmark another um basically another KPI that they can use when they choose which model they want to, you know, use. So, if you take like Gemini 2.5, I'm I'm a big fan of Gemini 2.5 because it's

1:01:45

cheap, it's fast, you know, it has a big context window. Uh, but now I know that well, it scores like on number 18th on on this benchmark. So, perhaps I won't use it if I know that the data it's processing has information that I want to safeguard in in some kind of way. Right? So that's that's what we're trying to like give developers a a sense

1:02:09

of how these models actually, you know, perform. So let me ask some maybe dumb questions. Yeah. So what you're kind of showcasing is

1:02:20

that these models have some built-in system prompts or rules, right, to guard to guard against certain things. And you're showing how effective it is with those system prompts or rules. So you could maybe take a leap of faith and say, well, if I wrote a system prompt where I was I had specific rules in there, if the model doesn't adhere to its own rules, how well is it going to

1:02:44

adhere to the system prompt that I give it? And so exactly if you find the ones that score higher, they're probably more likely to adhere to, you know, rules like don't allow people to share their name or something, you know, whatever whatever personal identifiable information you want to block, right? or whatever it could be,

1:03:03

whatever security you want to add into this system prompt, it's basically showing that it it how well you how well it does it adhering to that. Is that am I understanding that correctly? Yeah, absolutely. And I think that most

1:03:18

people like we call it vibe security. Most people actually use the system prompt to try to guard rail, you know, their agents. That's that's the I think the most used tool to actually guard rail your agent is the system prompt. I don't think that many people actually use like guardrail models or have a dual

1:03:40

model setup or whatever, right? Because it adds complexity to to your setup and it adds latency and cost. Yeah. Um so you're absolutely correct. Um and

1:03:53

um that that's exactly what we we're trying to like showcase like how well do these models actually like follow their own guardrails. One other thing that we're trying to do so we've just opened up like the the dashboard that I show showed you. So let's say that you add you know a specific system prompt. It

1:04:15

would be interesting for you to have another agent try to hack that system prompt to see how well it actually performs. So right now you can go in and you can actually add your own API endpoint and our tests will run uh on your agent to see how well that performs contra like a raw JPT system prompt. Right? So, it's also a tool where you

1:04:41

can, you know, test your own specific setup of prompts and and guardrails that you have in order to see how effective they are. What this data proves to me though is like people's most favorite models are have like data protection of 60 like the score is 60, right? They don't even know it. And if let's say they know it now that means that they need to go to the

1:05:06

application layer to handle a lot of this stuff right correct I'm just sharing the actual benchmark here so we have that as context but yes that's correct they have to currently there's no like the the build the model itself doesn't have adequate like defenses right so you need to do something you have that or whatever Exactly. And I think that like for Asian, you know, it's a plus one.

1:05:38

Yeah. And and I think that like for agent frameworks like Mustra, one of the core things that I look for when I'm trying to when I'm building agents or when I'm uh you know trying to build something cool, whatever it might be, production or demo um I will look at how easy is it for me to hook into these primitives that these agents have like and I know you guys have these

1:06:04

input output like hooks that I can hook into and process messages before they ever reach the user or before they ever reach the agent itself. So that becomes you know more valuable and there are some frameworks that don't have any hooks at all which makes it really hard to actually you know uh build guard rails around like a tool call or a tool

1:06:29

result or an input prompt or whatever it might be. That's awesome. Yeah, for sure. There's a few comments in the chat, so

1:06:40

let's pull pull those up. Queen Quinn, I don't know how to pronounce your name. But this would be a good tool for buyers and sellers of AI agents. Okay. Yeah, I see that. I think

1:06:54

you know Ismile likes to hear that. Probably you can prove how it performs on the benchmark before buying. These guys could open a consulting arm of the startup to sort out those industry and implementation details. Is

1:07:05

this gonna work for you? Is this even a plant? Yeah. Who are you? Are you a Are you a plant?

1:07:11

I wish I wish it was a plant. Yeah, we could build a whole forward deployed, you know, military that goes out and tries to do these kind of things. But yeah. Yeah, I honestly think people just, you know, they're a lot of bigger

1:07:30

teams aren't far enough along to even have started to figure out the security implementation details of a lot of this and they're going to find out quick, right? Because it isn't. I think people just expect that when they use software it'll be secure and it kind of really isn't. There there's a very it's very

1:07:48

easy to make these things give out information it probably shouldn't. And one of the So, I guess how would you respond to this one? The problem is that these benchmarks differ from industry to industry. Healthcare is different in

1:07:58

different than entertainment, I think is what that's supposed to say, for example. That's correct. And I think that the way the that's exactly what we thought about when we built this. So, you know, I've

1:08:13

set up all of these test runs myself, but the goal is to be able to customize each test prompt or is each test case for your specific agent and for your specific use case. So, our our job is to basically create a tool where anyone that's building agents can go in and specify, let's say you're building a healthcare agent, right? and you would

1:08:40

want to specify specific test cases or specific stuff that you would want it to test, you should be able to do that. So this is more of a call to developers to start you know measuring these these things and not only on a you know uh general level which this benchmark is but actually their own agents their own

1:09:03

prompts their own like scaffolding that they have set up. Um, so that's what we're trying to do and you can actually do that right now. And I think that, you know, in general, I think that models still are aren't that capable. So they can't do that much harm. You know,

1:09:23

yes, they can leak some information that they shouldn't. You could get sued, but think like this, like in five years, these model will be five times smarter than they are today. Think of all of the nasty stuff that these models might be able to do given that somebody is able to g you know get access to specific type of u information or even be able to

1:09:51

give it specific type of instructions that these smarter models can actually use. So this becomes you know I agree with you Shane that you know most people don't even think about this yet but as these model gain you know uh intelligence it will be even more important. Are you going to publish this data set or is it more like is that the secret sauce?

1:10:17

No no we're going to open source everything. We're going to open source the attack agent. We're going to open source the data set that was used to train the attack agent. So people can

1:10:27

train their own attack agents if they want to. And we're going to give access to anyone to run these uh tests on their own agents for free. So you can just go in, sign up, and you can start running your own tests.

1:10:41

Nice. It doesn't cost anything and and you can run it. And I think my like my goal is to get as many agents tested as possible. It's like a STD. You want to

1:10:55

get as many people tested as possible, right? So uh and then hopefully be able to sell some kind of uh you know condom on top of that, right? No, I'm not sure. But but that's that's like the analogy that

1:11:07

we have. That's uh that's the analogy we use, you know, the STD analogy. So, it it it works pretty well. And I think like, you

1:11:18

know, it's pretty obvious to me when I look at this that, you know, current models, they they are good at certain things and bad at certain things. And older models are good at, you know, some things and they're not that, you know, good at other things. So it's all dependent on how these models are trained on what data they're trained on. And I think actually you know these security flaws that that

1:11:45

we you know try to try to you know spotlight here these are actually features in the models that we use every day. So being able to hijack the model can actually be a feature if it's done you know uh in a way where it actually provides some kind of value like in the claw code case that I mentioned. So it's not black and white but I think it's

1:12:07

important to measure and I think it's important for developers to you know hook this up to their CI/CD or whatever and every time they push a new prompt change maybe they should run a benchmark and see how well does it actually do the things I want it to do. And is the attack model out or that's coming out? It's coming out. We just released the benchmark. So we are just trying to

1:12:31

release everything on hugging face so anyone can like play around with it. Um and you know set up their own environments and and try to try to run their own benchmarks. So, currently you can only run it through our like app that we have, but hopefully in a week or two you should be able to run it on your

1:12:50

own uh you know within your own firewall. That's awesome. So, like in the future technically, let's say we're let's put business out of it for one second, but like technically in the future, let's say I'm building an agent. I could then build I should then maybe have a

1:13:08

best practice to also build that agent's villain and to always be running them forever attacking each other or like you know just getting always attacked and then you know you just iterate and then one day you can always fortify yourself from attacks because I like it's like algebra for me you know like you you do like algebra the last step in

1:13:33

any algebra like uh you thing that you're doing is to actually test the values that you get out of those variables and see that the thing that you you know think is the answer or think is the thing you need is actually the thing you need and I think that's a good approach and I think that you know most labs like most labs do this already and uh it would be cool if it was

1:14:00

available for people building agents as well not only like the top labs that uh red teams or whatever that go through these kind of cycles with the model releases and all of that. So I think it would be cool to to democratize like the whole testing suite for you know uh these type of agents with with safety uh as as the as the goal.

1:14:24

I smell an integration in our future if anyone smell that. Yeah, let's do it. Let's do it. We're up for that, you know. Yeah.

1:14:36

All right, Ismael. Uh, it was great having you. We do need to move on to the next segment of the show, but people can follow you. I think I have this here,

1:14:49

right? We'll see. You're on X. Yes, sir. Right there. It's on the screen.

1:14:55

P E L A S E Y D on X. for those listening in the audio only version for those of you that you know there are a few. Thank you for having me guys. Yeah, thanks for coming on sharing some

1:15:09

wisdom showing some benchmarks. People will be excited see that and try to build more secure agents. Cool. Have a great day guys. Bye. See you.

1:15:25

I love that dude. He's And then we got condom reference times two. Yep. Yep. I knew I I was like, "How's he gonna top it?" And he waited till the end. And you know, just just to be clear

1:15:39

for anyone that was, you know, paying attention. Quinn Queen. Quinn, I don't know what Ken like in Spanish. Ken. Okay. Yeah. I can't pronounce things. That's fine. Not a plant. Just

1:15:53

working on their own master agent. Awesome. That's cool.

1:15:58

All right, dude. Of the evil twin. I like What's that? The evil twin.

1:16:04

Yeah, we should do some in that. Anyway, yeah, we got uh backtrack now because we, you know, we were talking so much about TSAI. We didn't get to the news before Ez got on. So, we're going to

1:16:17

backtrack in time and do some AI news. We'll keep it pretty brief. We'll try to hit the high level. Uh there's a few things that we'll maybe just gloss over a little bit,

1:16:29

but thank you all for watching. This is AI Agents Hour. We're going to talk about some news now. So, first off, this

1:16:35

is just nothing to share, but on the screen at least. Apparently, Claude Code is going to be is nearing 1 billion of ARR. Damn, it's a lot of tokens.

1:16:47

It's a lot of tokens, dude. Just, you know, just melting GPUs over there. Damn. And Anthropic is expecting 4.7 billion

1:16:58

in revenue this year and OpenAI is expecting 13 billion which is actually surprising. I would have expected OpenAI to be 10x. So maybe it's like the the coding products are the one that's driving it, not Claude versus Ghhat GPT. Yeah, I think honestly

1:17:18

I imagine a lot of people are like coding agents are turning through a ton of that. Mhm. Dang though, that's crazy. Yeah, those are some big numbers when you think that these companies, I mean,

1:17:31

they existed a few years ago, but they didn't really exist in the in the minds of most people, right? No one, you know, three years ago. Yeah. Three and a half

1:17:39

years ago, no one knew who OpenAI was really, right? Like maybe it was like a research lab. No, now it's closed AI. Yeah, now it's closed AI.

1:17:51

Uh so OpenAI did release kind of some notes from kind of an incident where I I'll just share it quickly where they basically said that on September 15th they launched GPT5 codeex but the people were complaining that the codeex quality really dipped. So there's a whole bunch of findings here. So, this is kind of like a post-mortem in a way, which which reads

1:18:21

really cool if you're an engineer and you've done these because you can kind of see what they investigated, what they found. It wasn't just one thing. There's a whole bunch of findings. But I do think that

1:18:34

this is, if you're building an agent, this is kind of concerning in some ways because it they're having a hard time specifying why quality might change, right? If the model provider's quality is changing when they're making updates, then what are you you're dependent on the quality of that model. So your agent quality could change without you even necessarily knowing it,

1:19:00

which is, you know, cue the discussion of to eval or to not eval all of those things that we've we've mentioned in the past. But it's just a interesting read. You know, some interesting findings.

1:19:12

They had a whole bunch of different uh findings and actions that they took from this article. Surprising that they did a postmortem, so cool. I mean, I think, you know, but when they call it Ghost in the Machine, it's not very reassuring.

1:19:28

No, but but glad they did it. I mean, they should be It's good that they're open about some of this stuff, right? Make it sound cooler than it actually is. So,

1:19:39

yeah, true. I used to I once wrote a song called Ghost in the Machine. So Oh, yeah. I mean, there was uh many other prior prior art references. It wasn't that

1:19:52

original. So, this is something we talked about before. You've used composer. I'm just sharing the blog post because I think it

1:19:58

is interesting. If you're if you want to know how cursor built the model, they're using RL. you know, our friend Andy, Professor Andy could tell us more about probably what they did, but they go into detail on some of this. So, it's a mixture of experts architecture.

1:20:16

They use some RL. Have you used comp? I have not. Uh, but they track it with

1:20:23

cursor bench, which is like their own benchmark. So, they so they're having some kind of basically eval, right? It's a benchmark. It's eval. It's the same thing for the most part, but they're they're trying to just, you know, gauge

1:20:37

the quality and you can see they they basically talk about how it scores on things. But I think what you're probably, you know, I think what you would say based on what my conversations with you is that it's fast, but it's not as good as like maybe the the frontier models on complex stuff. Is that right? Is that your experience? That's my experience. Like

1:21:01

it's super fast. It's like really fast. Um and sometimes I think it's too fast because I I don't have the the trust, you know, based on using claude and stuff. Like if sonnet

1:21:14

moved that fast, you would hit stop because you're not like not comfortable. Um but it's really fast and when you know what you're doing, it's really helpful. But I still think when you give it a really general task and don't prompt efficiently, it can go off the rails. But then it goes off the rails very fast because it just starts making

1:21:32

moves. Boom, boom, boom, boom, boom. So I should I mean if you if you're using cursor, try it. But if you're already

1:21:37

using like cloud code and already doing stuff and already have a setup, I wouldn't bother changing. Well said. Uh question from Ken.

1:21:51

Canra evals be useful to hook into an RL library? Most of it is Python, so you'd probably need to do it over HTTP. What do you think? Um, until RL libraries are native in

1:22:05

JavaScript, uh, probably can't unless you can execute Mashra in a RL environment. Uh, which we talked with Andy about last time he was on. Um, I don't think there's a good way for JS devs to get into this right now. Um

1:22:23

but there are more and more tools becoming available. So I do think over time it might become easier but yeah right now probably probably not an easy path but a path that will likely be paved over the next bit. Yeah and like Ken says would be cool if Monster helps build agents but also RL products and since Typescript is right there in the front end you can train browsers and all that stuff. 100%.

1:22:48

That's why we're I mean we're friends with Andy because we're friends, but we're also friends with Andy for a strategic purpose of doing that one day. So, for sure. Yep. All right. So, one of the things mentioned in this cursor article that

1:23:04

got me thinking is they they said the result is productionready model integrated into cursor's agent harness. And then I got me thinking, wait, I've been seeing all this talk about yeah, what is an agent harness? So an agent harness is basically like built-in tools and capabilities for agents. So rather than you controlling

1:23:25

all of it, you're basically using an agent that has like a built-in harness like a cloud code SDK, right? That they had built-in tools already and you can add more to it. So it already has like a harness and you know leave it to leave it to decks. You know, we can't go a month without

1:23:43

talking about decks either. To kind of introduce this concept of harness engineering, which I don't think this one's going to stick like harness engineering. Yeah. I

1:23:55

Dex, I I think this one was a was a swing and a miss. That's my That's my, you know, preference. It's my thoughts.

1:24:01

It's not going to stick like context engineering did, but I applaud you for uh for making your ugly diagram even harder to read. I think so the harness the thing that a harness exists like cloud code and uh cursor and whatever that seems like it's going to be a trend right because also people are on the responses API

1:24:26

right now on Twitter too which is essentially kind of a harness right that if you're using the responses API there are built-in tools that even chat GPT is like doing memory and stuff saving messages for you behind the scenes which is the same thing as a harness It really is because you can use the responses API to then build agents and So it's

1:24:45

not like some new here. Um that's one thought. But I think that that that responses API if it expands more then that is a harness. So that's like a trend that can continue. Or there might

1:24:57

be something like people will make SDKs that are just base model APIs and then you have ship like just doing stuff behind the scenes. So on the one hand I I don't think this name is going to persist but Dex please come on the show and let's debate it. But I do think one of the things that will become more and more of a pattern

1:25:21

is is more of these like this, you know, quoteunquote harness or these tools, these like built-in capabilities moving into the agents behind the API, right? So rather than just calling out to a model, you might be calling out, your agent might call out to another agent or might be built on top of another agent that's already has tools baked in

1:25:39

and a lot of those tools are going to be behind, you know, behind the the API. So rather than just calling to a model, you're calling to an agent. That agent has some built-in things. So you don't have to you don't have to build all the tools yourself, right? You're just kind of turning them on and they're they're

1:25:52

available. Yep. And then you can serve your agent in a REST API and then no one will ever know you're using an agent. It's just circular, guys. It's just all circular. Yep.

1:26:05

All right. Some, you know, we'll do some quick hits on some announcements. So, Cerebrus is now powered by GLM 4.6, which GLM is one of the many, you know, Chinese

1:26:21

models. And I'm calling it next week we're doing the big Chinese model show. So, we're gonna we're carving out half hour. Next week is going to be dedicated to just talking about Chinese models.

1:26:33

So, if those of you that What's that? The Chinese are coming, dude. The Chinese are here, man.

1:26:42

And we we're going to talk about it because I'm going to be honest. I'm I still can't keep track of them all and I try to pay attention to all the stuff that's going on. But we'll we'll do a little bit more of a deep dive. We'll try to try to plan out a bit more and talk about all the different providers.

1:26:58

So, some of the models, some of the strengths and weaknesses on the benchmarks. Yeah. And if nothing else, you'll leave more confused than you came, but hopefully have learned a little bit. But you'll know that the Chinese are here.

1:27:10

But the one interesting thing is uh how much cheaper this is. I think it's you know like 30 to 60 to 50% of the cost of or maybe it's even less than that. I'm going to look I got to look it up. It's it's significantly cheaper than sonnet.

1:27:29

Yeah. I think you get like you get a million transactions per minute. That's a lot.

1:27:37

Yeah. So, it's fast, it's cheap. Um, yeah.

1:27:44

All right. So, that's that one. Miniax.

1:27:50

Mini max. Miniax. Another another uh model from not the United States, but China has basically announced their M2 plans.

1:28:03

So Miniax M2, it's 8 to 10% of Cloud Sonic cost for their coding plan, two times the usage limits. Basically, they're they're trying to, you know, take away some of some of Sonnet's usage, right? They're trying to dip into some of the Yeah. some of that

1:28:22

uh enthropic ARR that we talked about earlier. Exactly. And you know, also if they're they're just trying to give frontier quality in open source and outside of this stuff, right? So

1:28:34

yeah, I mean I still, you know, I think I'm definitely pro using grabbing these models, running them yourself on, you know, inference providers. I'm still skeptical of like using them, you know, like going to Deep Seek and just using it from from Deepseek itself, right? Because yeah, obviously your data is getting sent to a place where you don't know where it is. So, but the prices are going to be very

1:29:02

compelling. So, yeah. Yeah. Make your own decisions, I guess. All you cheap fools out there. So,

1:29:08

that'd be interesting. Yeah. Uh there's an, you know, on with more announcements. Kimmy K2 thinking uh model is here.

1:29:22

So I think that previously they were had announced that it's coming. Now it's it's here as of yeah a couple days ago. So there you go. State-of-the-art on some benchmarks. T0 sequential tool calls without human

1:29:39

interference. Excels at reasoning. Pretty good size context window.

1:29:46

So from Kimmy Moonshot. And yeah, that's it. That's all the model updates for the week. Nice. One new segment I want to introduce, and

1:29:57

we've done this in the past, but you know, I just want to maybe do it more often, is interesting GitHub projects of the week. Let's do a star party. You know, let's let's start some GitHub repos.

1:30:09

So, all right, first one. This one already has a lot of stars, but you know, you could throw a few more on there. I will. Why not? It's called Tune. It's a format

1:30:27

kind of built around token optimization. So, it's tokenoriented object notation tune, which you know is supposed to be compact human readable schema JSON for LLM prompts. Dude, I love that. This is sick. So, 30 to 60% lower token usage. So you

1:30:47

can see there's an example here. So here's the JSON and it's it's almost kind of like CSV but it's just very token optimized, right? Yeah. And so in a lot of ways it's it's

1:30:58

similar to like a CSV and script. Yeah. Yeah. A little bit.

1:31:06

So it conveys the same information with fewer tokens. So the idea is imagine you had a big JSON, you could potentially send this and save 30 to 60% token costs. Okay. Uh so it does have some caveats here.

1:31:24

Uh I'm trying to see where it is, but basically it's intended for LLM input. So you'll still use JSON, but then you'll convert it to this before you send it to the LLM. Now I do think and this is something that Tyler on our team had mentioned is that accuracy might decrease just because the model isn't as trained on this notation

1:31:49

yet at least right is like this is kind of ideally like we can look at this and pick it out but how well will the model do if it wasn't in the training to be determined I don't know if they so they do have some accuracy and apparently It's, you know, better, I guess, with less tokens, but I do have, my first thought is I'm I would be

1:32:14

shocked that it could be better. Yeah. So, I' I'd want to uh, you know, do my own tests on this, I think. But

1:32:20

definitely could save a ton of token costs if you're pumping a lot of tokens through. And honestly, we hear a lot of people that are kind of at that point where if they get to productionizing and they're actually going to have real, especially if they're building like coding agents or or things that are kind of around coding agents, they just churn through tons of tokens.

1:32:38

That's this is a great project. Y'all should start it. Yeah, go star tune. I also follow this guy on Twitter, so

1:32:45

maybe follow him too. Do you do the No, just just uh the main the creator and maintainer. All right. Well, there is

1:32:58

Johan. Johan, thanks for building something awesome. It's cool. And while you're there, give give us a star. Find the MRA

1:33:11

AI uh project or repo. Give us a star. And last one we're going to do, this one is less known, so we'll do one that's kind of somewhat well known and one that people have probably never heard of.

1:33:25

It's called MCP agent mail. So, we talked about MCP. We started with MCP. We're going to end talking about MCP

1:33:32

almost like we planned it that way or maybe we just fell into it. But you can actually give your agents through MCP essentially an inbox. So it can communicate with other agents essentially. It allows you to connect

1:33:49

different basically your different agents and they can basically send messages to each other and use that to build context. I doubt it really works well depending on your situation, but maybe there's it's almost like a multi- aent system you could kind of wire together through MCP. So kind of interesting. I don't know if I would

1:34:07

ever use it, but there you go. I'm gonna give him a star. give them a star. Anyways, yeah, I think it's pretty cool. It looks like it it has like a way to preview and

1:34:19

search the messages so you can see it's kind of like you you send these agents out and you kind of watch how they talk to each other. So, it's almost like it'd be cool for simulations. I think you could do some cool simulation things with it and see how um how the agents actually communicate and and what decisions they make because they can now talk to another agent.

1:34:38

Yeah. or like you send a task to an inbox and then you know kind of have like a whole bunch of agents working off that inbox and stuff. Pretty sick. Yeah. So it's almost like distributed

1:34:50

use. I think the whole like inbox task thing is is good. You got a job cue through email. There

1:34:57

you go. All right. Nice. I like this segment.

1:35:03

Yeah. So we'll bring that uh we'll try to do that more often. If you have cool projects, you know, cool interesting GitHub projects that are open source that we should uh talk about, send it to us, especially if they're AI or agent related. And with that, thank you all

1:35:21

for tuning in. Any parting words before we go, Obby? Uh, super stoked for next week's episode on the Chinese models. Thanks everyone who went to TSAI or is going to watch

1:35:33

all the YouTube content that we're about to generate. Yeah, it's been great. Yeah, subscribe to our YouTube if you're not already. You'll see when we release all these TSAI talks, which are going to be really good. So, please go there,

1:35:48

click the subscribe button, click the like button on this video, please find us on Spotify or Apple Podcasts, give us a fivestar review if you really like it. If you think we suck, don't do it. Um, you know, we don't we don't need any of those onestar reviews. Tell more of your friends about the show. Like it's really nice having a lot

1:36:08

of live chatter. So the more live people there is the funner the show gets. And I we're getting a lot more live listeners now which is dope. Like like this one.

1:36:20

Yeah. I don't know how to pronounce it. Well, how do you pronounce this? You're going to be the pronunciation person. Every

1:36:26

time we go up, you get to Dev Ka. I don't know. Yeah. Devkuna. Kunha. Yeah. Says nice.

1:36:35

Thanks for watching. Thanks for tuning in. Follow Abby on X at Abby. Follow me on X

1:36:42

at SMT Thomas 3. Until until X lets me buy the Shane handle. I'm I'm trying to I'm trying to buy it, but you know, I don't think I can afford it. We'll see.

1:36:54

One day, someday I'll I'll own Shane. I doubt it. Until then, and probably for always at SM Thomas 3. Okay, everybody

1:37:05

have a great day. We will see you next week. Peace. Peace.