Back to all episodes

To Eval or Not to Eval, Massive JavaScript vulnerability, Corbin from Artifact, AI News

September 8, 2025

There was a lot of hate for Evals last week... so should you be using Evals in your AI project? We talk about a massive JavaScript supply chain attack that impacted 18 core NPM packages. We chat with Corbin from Artifact to learn how AI is being used to build an IDE for electrical engineers. Finally, we cover all the other AI news with discussions around AI safety, parenting in the age of AI, AI legal settlements, some stealth models and much more.

Guests in this episode

Corbin Klett

Corbin Klett

Watch on

YouTube Spotify

Listen on

Episode Transcript

0:24

Hey, hey, hey. down. You love D.

3:11

Hey. Hey. Hey, hey, hey.

4:12

Hey everyone and welcome to AI Agents Hour. I'm Shane. I'm here with Obby as always. It's Monday and dude there's I was

4:23

worried that you know there's not a lot to talk about today compared to other weeks and then you know right before the show I we kind of ran into something or Yeah dude we have material now which is good the you know just you just got to like keep your eyes open an hour before the show and something always seems to happen. Yeah. But before we dive in Yeah. Before we dive into that, how was your weekend,

4:48

dude? It was good, dude. I am in currently in New York City visiting a bunch of MRA users and friends and stuff like that. And it's pretty tight. Currently in like

5:00

a co-working space with one of our guests that might be coming on later. So, yeah. Awesome. Yeah. I I know you So, at one point in your life, you were considering moving to New York and then fate brought

5:12

you to SF. Yeah. Well, YC, I guess, was the fate there.

5:17

Um, it was actually interesting. I was like walking around reminiscing like because like before YC, I was living I was like scoping it out here and like I was just trying to get figure the city out. And then now like I know the city pretty well. You know, people invite me somewhere. I'm like, "Okay, I'm just going to take the subway, get

5:34

there really quick." Um, but then in SF, it's completely different. So, uh, way different city vibes and know and the startup culture here is a lot different, too, which is interesting.

5:47

Any big takeaways? People still talking about GPT5 here. That's how uh different things are around here. Um, and I think news probably travels a little slower. So, the things that I heard people talking

5:58

about are things that we've already talked about on the show and we've talked about in person in SF. Um, so I think that AI really does live in San Francisco. I can really see it now, you know. Yeah. And I do imagine that New York is would be considered a fast follower,

6:14

right? You know, it's it's got to be, you know, up to date more rapidly than, you know, maybe the Midwest, which is where I'm at right now. Right. So, yeah, they were saying someone was saying that like the news travels like

6:27

two weeks two weeks behind in terms of like the news circuit or people don't give a until maybe it gets a little bit more penetration. But for us, right, when we see the headline, we are automatically on it, you know. Yeah. So, I guess what maybe what we talk about today, you know, New York's going to be talking about next week or the week after. So,

6:46

maybe also people here people here party, dude, like in the sense that like a lot of people are working many long hours like 996 or whatever. Um, and then a lot of people still have time to socialize after 9:00 or whatever because the city doesn't sleep. So, it's kind of like people are hustling and having fun and hustling every day. I'd probably get

7:10

burnt out if I did that, but I'm down to try it for a week. Be careful, my friend. Yeah, I will try. Too much fun is not good. Uh oh, dude.

7:21

Yeah. Well, I mean, may maybe with that we can kind of talk a little bit about the vulnerability. So I this kind of just came out so I know only a very small amount but essentially there was someone's MPM account was maybe there was like a fishing attack he lost lost access he or she I don't exactly know lost access to their account and 18 core MPM packages

7:48

were hacked including some ones like chalk and debug and like some just bigname packages that add malicious basically malicious releases, right? Yeah. Even like the debug module which a lot of people use uh and it's built into and like strip ANC and chalk like any CLI tool you're using has those as dependencies. So it's really if

8:13

you ask me. Yeah, it sounds like it has been fixed now or a fix is going out or what's the So, it's interesting like a fix is definitely going to be going out except no one like this is interesting because our depend we have dependabot on all of our pack packages and it it is not aware of this yet because this is so like new but I'm sure dependabot will be having a

8:39

you know vulnerability fix and people can just merge it and stuff but we might need like for us we might need to do a release today even though release day is tomorrow. So once we get the fix we'll have to you know do like a hot fix to get people not using these um vulnerable versions. It's kind of a mess. Like this happened many years ago and we

8:58

probably I you know we date ourselves when we bring this up but there used to be like this not the same type of thing but like back in the day left pad was a module that was dependent by many many users and I think they just deleted the module and then everything broke like the whole JS ecosystem broke. This is

9:16

not like that but it's pretty bad because now the whole JS ecosystem is vulnerable if you have an attack vector like this. So, they're going to talk about this for a while. Yeah. And it's one of those things where it sounds like the biggest, you know, essentially

9:35

what they try to do is they almost wanted to get this into like crypto wallets that are built in Typescript and then replace addresses if you're using those wallets with their crypto address. So it was, you know, essentially if you're not building a crypto application, maybe you'd be okay. But still, it's obviously a security issue that you need to get we need to get patched and we need to get up, you know,

10:00

upgraded. But if you are using a crypto wallet, you probably shouldn't use it. I basically heard if you're using a software crypto wallet, probably just don't use it until this thing's fixed pretty much. But then all the TS all the

10:14

crypto companies backed by these TS libraries like man must be there hasn't been any horror stories reported on any like thing yet but I'm sure whether where there's a will there's a way with attackers. Yeah. And it sounds like there, you know, someone's been monitoring the basically the accounts and no money's been moved yet, but who knows? You know, it's like kind of there's a lot of

10:39

things in flight and this just came out. Like I just saw it less than an hour ago. Yeah, I think it just got reported 12:48. I don't know what time zone that

10:51

is because I don't know what time zone I'm in, but uh yeah, it's very new. Yeah, it was in the last few hours. Yeah. And all the modules are like publishing new versions. I guess this is the thing that Python

11:04

people also have to worry about. But you know, since our our stuff is built building the whole entire web, we're like when like this happens. Yeah, man. Definitely. uh if you are using yeah just plan on if

11:23

you do have TypeScript or you know if you're using MPM plan on doing some up updates because you there's a very decent chance that some of your projects are vulnerable if you've I don't know how long the vulnerability has existed right that that's not clear so if you upgraded during a certain time you're you'd be vulnerable and then you

11:41

obviously need are going to need to upgrade soon to you know fix this Yeah, we have a comment from the audience waiting for the fire ship video on this. Dude, me too. That'd be really I can't wait for to see the explainers and everything. But the security researcher,

11:58

I think his name is Charles Charles Gimtt. Um he he shared this Substack uh of somebody else that's actually goes deeper into this. If anyone's interested, I'll post it in our chat here. Y'all can take a look. It's a really good explanation of things. It's kind of crazy how this works

12:19

sometimes. Yeah. Anything else we should talk about any any more on the the vulnerability?

12:30

Well, you know, this was good drama, but there's more drama to talk about, so I think we should move on to that. Yeah, the drama never ends. And in this case, if you have been paying attention to, we'll call it like AIX or X AI people over the last week, you would have seen a lot of drama around Evals. Eva, yeah, you can't predict when this stuff's going to happen, but we're going to be

12:56

here to talk about it. You can you can predict that. And yeah, let's talk through some of what happened and our reactions. And then yeah, if you are if you are watching this live, you know, first of

13:08

all, please uh give us a five star review on Spotify or Apple Podcasts. We appreciate five stars only. Yes. Only if you're going to do five stars. Otherwise, you know, find something else to do, please. Yeah.

13:19

And but if you are watching this live either on YouTube or on X or on LinkedIn, please drop us a message in the chat and tell us what you think because this is a now a very hotly contested topic over the last few weeks is, you know, should you eval or to eval or not to eval? That's the question. Yeah. And there's a lot of big hitters that are talking too. So yeah, we should go

13:44

through what everyone else says and then we can give our own take, I guess. Yeah. So, so I think this was one of the big big posts that kind of got it started. Yep. So, Lenny, you know, big name.

14:02

I guess what? Lenny's podcast, right? That's the big Yeah. The one where they they have a campfire or something.

14:08

Yeah. Lenny's podcast. Lenny's newsletter. So, had a post trend I'm following. Eval is becoming a must-have

14:14

skill for product builders and AI companies. It's the first new hard skill in a long time that PMs, engineers, and founders have had to learn to be successful. The last one maybe SQL, Excel, and then gives a few examples, you know, from quotes from other prominent individuals.

14:33

Gary says eval are emerging as a real moat for AI startups. Writing Japanese people believe that, too. Just sorry to interrupt, but the Japanese people believe that, too. If you're watching from Japan, I heard y'all when

14:44

I was there. Yeah. And honestly, I remember when we first went into YC, that was some of the group partners were saying that some of the other companies that were having success had spent tons of times, tons of time on their eval data sets and and really just running evals on their AI products. So that

15:00

there was a lot of push from just YC in general that eval are very important. And then writing evals is going to become a core skill for product managers from Kevin Wheel, which is the CPO of OpenAI. Um Mikey K Mike Kger from Enthropic writing evals is probably the most important thing right now and you know previously of Instagram right co-founder of Instagram

15:25

we have Sarah Guo is that how you pronounce that Guo but obviously very popular says eval are your new marketing and then of course you know you have Greg Brockman saying eval are surprisingly often all you need so a lot of pro- evvals s right there. Yeah. And that's where the pro evales.

15:49

And then here comes the hate, right? And this is, you know, as expected. There's more. So then Swix responds. Claude Code, no evals. Well-known code

16:05

agent company, no evals. Well-known code agent company, too, kind of half-assed evals. Leading vibe vibe coding company, no evals. CEO of company selling you evals.

16:17

All my Yes, all my top customers do eval. VC's in love with CEOs of evals companies. Yes, all my top founders do eval. You must do evals. So,

16:31

which is kind of hilarious. I I thought that was pretty funny because there's truth to it because there is truth to it and yeah, I'll save my my thoughts to to the end. We'll let's keep going through. Then there's just, you know, a whole collection of additional

16:53

additional people here that are kind of chiming in. And so we'll go through some of these, you know, in no and this is now in no particular order. So Alex Reedman from Agent Ops, Agent Ops, Yep. said, "Evalss are a scam and we're being gaslit into believing

17:11

they aren't." And then of course has has a post that essentially the TLDDR is here in the headline. Evals are a scam and explains why. I I did think that it was funny that

17:24

he listed all the agent or LLM ops startups and there's just like so many of them. It it is true. There are a ton of them. A lot of them are YC companies too. Yeah, I don't disagree with that. I

17:36

think that is because everyone real, you know, there's reasons for that, I believe, but it is they're definitely overcrowded. It's an overcrowded market. And then Justin from Helone. Is that how you pronounce that? I don't

17:56

know. Pronunciation's hard. Helicone. Helison. Helony.

18:01

Yeah. Yeah. Helicone. I don't know. But they're YC company. He's a co-founder.

18:08

Literally. eval. We wasted so much time at Helone building evals because customers wanted it. But this is a

18:14

classic example of the mom test. We needed to distinguish the core customer problem with what they were asking for. Customers want better products. Evals is like cleaning a kitchen with a toothbrush. Let the model providers focus on evals.

18:29

You focus your focus using the model should be building a great product. Okay. So, another pro or anti- evval post. You know, it wouldn't be a episode without throwing in, you know, a comment

18:45

from Dex. Why not? Did not have on my 2025 bingo card.

18:51

Well, Dex needs to get a better bingo card. But then you had uh you Sherwood who was on, you know, on the show not a couple weeks ago, right? Big eval. It was always a lie. Which is

19:06

funny because I I feel like Sherwood and you probably talked because you and I have talked about Big Eval. I mean, you've been saying Big Eval forever. Yeah. Well, Big Eval started when we

19:19

were at an observability conference and I was like, Big E doesn't want you to know. Yeah. And you can even see that was the day. Yeah. Yeah. So, Sherwood was the exact day. You probably talked about it at the

19:30

conference with Sherwood. He posted it big eval June 25th. And then now it's like, okay, people other people are catching on. Yeah. And then, oh yeah, put it right there. Biggie eval

19:42

doesn't want you to know. That's me. And then uh so Julia from Quotient AI, but formerly GitHub Copilot, has a good post on why emails haven't landed yet and lessons from building them for Copilot. This is a great article. So we can kind

20:03

of go through some of the highlights but most eval don't land with uh engineers. Uh so the yeah I guess the the TLDDR is you know engineers are kind of good at writing tests right or used to writing tests but eval are not you know not the same thing necessarily right they're they're similar but they're not necessarily the same thing. Yeah, I think they were in

20:31

this article they were pushing more so that oh data scientists are used to doing this type of eval but product engineers are more used to deterministic test N10 tests etc. Um and they do they did a bunch of different types of evals for co-pilot um but even this post was contentious too. There are people kind of going back and forth here or trying to like clarify

20:56

different points and stuff. So I mean on the one hand GitHub copilot clearly is using evals right but still questionable on how you know how useful are they were they needed are they not needed so re recommend reading that post if you're you know want to learn another side of the story that I would say is a little bit more pro- eval right

21:22

yeah so that's the drama now I guess so much drama yeah now I guess uh What uh what's our take? big eval. That's my take. No, I'm

21:34

just kidding. Uh but it is kind of my take like you know ever since I was thinking about the big eval thing we were also working on what we call scores in MRA and we purposely didn't call it evals because we didn't really realize and maybe this is more foresight on our part that like eval is such an overloaded term just like agents just like everything in AI engineering is so overloaded

22:01

and for us we like okay what is what are we actually doing we're just trying score the response and that's it. Like we're not it doesn't tell you anything other than the score, right? You as a person have to then analyze, do error analysis, etc. And then you can make changes to your

22:18

product. But I think the problem with big eval and a lot of these big companies is like they do auto evals, they do, you know, things that are not necessarily custom to your product and then they say that you have to do them which is essentially just trying to sell their product to you, right? Um but what

22:37

what's your take? So my take is again I would agree with your statement evals are overloaded because when you think of evals there's really in my opinion two different types of evals and that even those two terms are still going to be overloaded but you have evals on what I would traditionally consider like on

22:56

your CI pipeline. It's like the goal of this is I call it like testing evals. It's to prevent you for up production. You don't want to ship something that's going to be a clear regression. So you have some kind

23:07

of data set. Maybe you run some evals. They should be custom to your application. Like the things that your users actually care about. I I've never

23:14

been a big believer in off-the-shelf evals. Some people some people like them, but I I mean I think for the most part that you don't need them. Maybe they're instructive. You can learn from them and then build your own

23:25

off of them. That's kind of our model in Ma is you can just clone it and then add your own business logic to it so it actually aligns with your reality. But there's like the testing evals, but then there's also what I would consider like monitoring evals. And that's like actual data coming into production that

23:42

you want to and maybe it's sampled, but you want to kind of sanity check to make sure that one, you don't see something that your users are doing that's either malicious or like potentially like a use case you didn't anticipate that kind of throws your application off. So you almost want to it's used to like find

24:01

those like random interesting events or like see patterns that you could then maybe pull back into your prompts into your your application. And so I I think one just defining those like separating those out everyone says eval is like that whole thing but that's a really big surface area. Yeah. Like what part of evals do you hate? Do you hate the do you hate writing tests or the testing part or do you hate like

24:26

setting up observability and monitoring tools? Because eval kind of plays into both of those quite a bit. I'm not sure what the point where do people hate like evals, right? like because from our perspective and I'll

24:40

actually use grapile uh cuz doc was posting he's very pro eval what they do is they have a data set and so for anyone watching grapile code review bot right and um they have essentially a data set of PRs that have successfully reviewed the code and as they do new versions of gravile they go test it on all those PRs to make sure they have no regressions And then if they're doing new models,

25:08

they have a certain data set to test as repile as functional as it was before. Once again, to not break anything for their users. So like why do people hate evals if it's actually I don't know. I think it's it's actually useful, but I wonder what kind of evals people are hating.

25:26

I mean, I think that's a loaded question. And I think a lot of people hate like the general like factual factuality and hallucination evals that are just kind of off the shelf that if you have a if you just go into a project and add a bunch of evals automatically or whatever and they're not really tied to what your users actually see or your data then yeah maybe limited

25:49

usage or like of how or usefulness of how useful those things actually are. So I think that gets some hate. I think the other thing is it's very similar to like test driven development. It's very uh people either hate it or love it and it

26:04

yeah it's it's going to you know piss some people off on either side. I think eval are the same thing especially when you think about like this idea of building this perfect golden data set that you run this huge suite of tests on every PR that modifies the prompt. I think people, you know, some people are

26:24

think and and I agree with some of these people that that's overkill, right? Like you're spending so much time building this perfect data set and you don't even know you don't even know if your product is going to work yet or if people are going to use it. And so I think especially, you know, we've seen I believe from a lot of YC startups is unless they're in like some

26:44

kind of like accounting or law or something where there's like factuality is incredibly important. They just typically skip eval early on. They don't they just get something out there, make sure it resonates with users, and then when they realize that, hey, it's the quality isn't good enough, which is almost always the case, right? it's never quite good enough. Then you

27:06

go back and you start adding these other things in like maybe you start monitoring production data. Then maybe you start adding like you you use some of that in a data set. So you say like okay we fix this. Let's make sure that going forward this type of situation

27:18

works well in the future. So I I feel like it's not a eitheror. It can be a start small and then build. Yeah. But it does I think depend quite a bit on the use case. You know one thing

27:30

because the verticalized startup agent startups right much like our guest and many others you know they're first focusing on building their product there are no users to care about evouse right um then maybe you have some users and then you start breaking much like software development is and what do we

27:49

do when we break stuff we write a test if you truly broke something for your users you're probably going to write some semblance of an eval whether you call it an eval or not right you're just going to make sure you never break your again. Yeah, at least in that way. But once you're done with product

28:05

development, you need to make your better. And like I think a lot of people are complaining about data sets, annotations, all the scientific method that maybe data science people, they're just so used to operating like this, right? So that's where if you have the industry knowledge for your product, the labeling and stuff is actually should be chill

28:24

because you have most of the context. I do think a lot of software teams are not building in their in like their knowledge like what they are specifically knowledgeable about like a dev working in healthcare is not a doctor right so maybe that it's more cumbersome to do eval because they don't know how to do it if something's

28:42

right or wrong but it also makes sense why a lot of the coding agents don't need maybe don't need eval because they're building the product for themselves it's like yeah does this feel good yeah like I can use it I know this feels right okay let's ship it you know So, do you think that's a bias, right? Like Claude Code is so biased in

28:58

terms of the user base that for them to say no evs and just vibes like Yeah, it makes sense because we're all devs using this. Yeah, they're shipping they're shipping for themselves. Yeah, they're building products for themselves. And maybe there's there's probably some Yeah, some deeper truths to just

29:15

building a product for yourself is obviously a little easier to figure out if the product feels right. Yeah, for sure. Yeah, maybe we should uh Yeah, we should got to monitor where people are like really hating on evals because for MRA we don't want you to use offtheshelf evals. We want you to write scores. We have like a shad CN type of score insert

29:39

now because people liked answer relevancy. They didn't want to use it. They liked how it looked and they want to edit it to do their own thing and then that's what they should be doing. Uh you should be writing stuff that's accustomed to your application. Yeah,

29:52

adding your own context, you know, pulling in your own, you know, the right data to make sure that it can actually judge if it's correct or not or or correct enough for your use case. But yeah, I think that, you know, there's definitely this kind of idea around, you know, I just see a lot of similarities between test-driven development and, you know, people hating

30:15

on eval just like they hated on test-driven development. But I also think that, you know, if you start small, you're going to add this stuff anyways. You eventually get to if you any mature product adds tests. Any mature product is going to add, you know, some type of eval over time to

30:28

your like to at least PRs that change the AI parts of your application. And then eventually you're going to add evals on the backside just like you add monitoring, you know, to do like latency and all these other things that it's just another metric that you are going to have to care about. It's just a matter of like when and how far along

30:46

can you get before you have to care about it. I think that's kind of the big thing. Yeah, I'm putting my bet on the vertical the vertical AI companies that will actually get the most out of emails. That's my bet.

30:57

Yeah, I think Yeah, absolutely agree. And then I I do believe that the bigger your company is, the more likely you're going to spend more time with evals early because screwing up has more risk for you, right? Yeah. So if now if you can get away with like 80% accuracy like a code agent like

31:17

if a code agent doesn't work well you just kind of like fix it yourself right that's so of course cloud code maybe doesn't need to be 100% all the time the expectation is not it's always going to be 100% but if I'm asking a tax bot a tax question I do expect that to be 99.99999% because otherwise you know there's probably some liability there. This kind of a good segue, but if you're like

31:40

designing hardware products and stuff like how like and that's like will then go and ship into like airplanes and freaking cars and like what are you going to be okay with just 50% accuracy? You're going to kill someone. Yeah. So, that's a good uh great segue. And with that, let's bring in our guest.

31:59

So, we have Corbin from Artifact. Hello. Welcome, Corbin.

32:04

How are you guys? Thank you for being here. Yeah, good to see you, dude. Yeah, you too. Yeah, it's been a while. You know, take

32:10

take care of Obby this week. Don't go too wild. We've been taking great care of him.

32:15

Yeah, we've been hanging out. I think uh Corbin's in one of these booths right next to me. Actually, we're in the same location. So,

32:23

yeah. Well, I mean that this probably, you know, a good segue. You want to give just a quick TLDDR? Corbin, who are you?

32:29

Tell us just a little bit about Artifact and then, you know, Sure. Yeah. Yeah. Um I started Artifact

32:35

with my co-founder Anthony at the beginning of this year. Um we were in the same YC batch with you guys and uh Artifact is software for engineers. It's like a design and documentation platform for electrical engineers. Uh so

32:49

engineers are missing a lot of the types of tools that software people have for things like version control, uh collaboration, um and uh AI features, AI co-pilots. Uh so we're kind of putting that all in in an environment that is making it uh much easier, much faster for electrical engineers to design uh complex systems. That's awesome. How did you guys come up

33:14

with this? Like was it were you working in an old job and you were like man this like let's let's build something like this. Like what's the origin story? Yeah. Yeah. I and I've I was thinking uh

33:25

that that conversation about eval is uh was was interesting case for us because uh engineering is all about uh solving the right problem at the right level of fidelity and trans traversing a hierarchy of requirements in order to uh solve the right problem and uh for me like that that so for for eval and a vertical company like ours I I kind of

33:49

see as we go up to higher and higher level requirements ments as in broader requirements like hey design me an electrical system for an aircraft. I see eval becoming more and more important where so it helps you scope the problem uh in a way that will uh make it more likely to be correct down the line when you are doing the detailed work and that's kind of the first thing that got me interested in starting artifact uh

34:14

working on engineering problems using uh this was you know GBT 3.5 or something like that back in the day when I was uh using it to study engineering problems working at an aircraft company uh at various levels of fidelity kind of looking at higher level design and navigation algorithms and asking GPT 3.5 to help me construct those problems to

34:35

help me solve those problems. And I realized like if you guide it well, it's actually really good. Um, and this stuff is probably obvious to everyone now, but a couple years ago it was all pretty new in the hardware engineering world. And

34:48

uh that got me really excited about figuring out like all right, how do we build a platform to uh bring in these co-pilots and help engineers solve problems uh much more quickly and much more reliably. That's awesome. And uh you got into YC with us. Uh where were you before? Uh I was I was working at a startup

35:08

called Hermus. Uh we were working on hypersonic jet prototypes. My co-founder and I both came from that company. We

35:15

work together on avionic systems, uh, simulation development, flight simulator and, uh, control systems for, uh, an airframe and an engine program. So, like completely not software really, right? Like as or like the software that I guess Shane and I are very much accustomed to. Yeah. To totally different. I'm I'm new

35:34

to uh, Typescript. I'm new to a lot of this stuff. We were at my at my old job in in a previous life, we were doing, you know, like simulation code like Julia, Python, uh, a lot of embedded stuff in C++. Um, and so I thought like,

35:47

oh, like I figure this software thing out. I got this. And then I started developing web apps and I'm like, oh man, this is like way more open-ended, way harder in some ways than uh some of the embedded stuff we were doing before.

36:00

You're the first person I ever heard that said that said that. So like it makes me feel makes me feel a little better about my skill set, but I I think you're probably exaggerating a bit, but I I appreciate that. So like what's artifact uh what is like artifacts uh built built in? What does

36:17

it kind of help the user do? Um it helps the user draw schematics and really it's about defining a single source of truth for a complex system. A lot of engineers now are uh actually uh if you're designing an electrical system for for a like a a power system, an engine, an airplane, a robot, a spacecraft, uh a lot of times, especially at these

36:42

smaller and mediumsiz companies, a lot of like startups in the sub 10,000 people range, um you're usually using like uh kind of generic tools like Microsoft Excel, Vizio, Draw.io, IO just like almost back of the napkin or whiteboard types of tools in order to design systems. Uh which means that you end up with many sources of truth. You might have a spreadsheet describing part of your system, a Python script

37:07

describing another part. Uh probably several diagrams that are just kind of a a mess, you know, saved as like, you know, v2 final like actually final like that type of thing. And uh so we just got we have a system where it's like you can draw the system and then all the outputs are are computed from that. Uh so it's one source of truth.

37:28

Yeah. Do you have a demo we could see or walk through? Pull up some stuff. We've also been dipping our toes in the MRA

37:35

ecosystem this week and adding some exciting new features. Yeah. Love to see even just a demo of the product. I think it would really help people who are listening to to kind of put it all together that because a

37:48

lot of people listening are builders, right? They're building some kind of application. They're probably using some AI or they wouldn't be watching this, whether it's like a co-pilot or just some AI, you know, some AI features added to their application. So, I think

38:00

it's it's always nice to see what are others building and how how do people think about this? Yeah. Yeah. Is um Can you see my screen?

38:07

Yeah. Yeah. Okay, cool. Yeah. This is our app. This is Artifact. Um, and what I've got

38:13

pulled up here is I've got I've got a drawing pulled up. I actually was just um I was just on a tab that shows uh kind of like a almost a get style interface for a version control of the diagram. Um, I've got a a project here that has like a several parts of the system. And uh, basically what users are doing is creating drawings for a couple

38:35

of purposes. Um, one, uh, there's a drawing like this which connects a motor controller, um, to a power system, a computer. Um, and it's really I'm trying to do two things here. One is, uh, define the system in a way that's easy to communicate to colleagues. Uh, and

38:54

the second thing is, uh, defining the harnesses. So all these, um, connections between, um, in this case, we have a bulkhead with mated connectors connected to a motor controller over here. and uh just like manufacturing these uh cables and ma mating connectors and connecting everything together is a a pretty tedious task that's pretty errorprone. So we also provide this you know

39:18

workspace to like document specifically how everything uh hooks together. So this is the type of drawing that uh a lot of our customers are making in artifact for a variety of like avionics types of systems. I love how it it gives me that VS Code vibe. It's like, you know, I think UIs are marketing ourselves as like an IDE,

39:37

right? Yeah. Yeah. We we uh we hope to move in that direction and it it does um there's

39:43

there's like a lot of smarts built in. So, you can you know output your detailed uh pin tables and you know click on things and see the connections highlighted. All this stuff is very high value for people who are like sitting at the work workshop um manufacturing these parts. Um, but yeah, we wanted to kind

40:02

of bring that software IDE feel into the hardware world. And you added version control too. Is that like something today there because it's so disperate with Excel and everything, there's no like true sources of truth, but now you have like git style version control here.

40:18

Yeah. Yeah. tracking tracking revisions is is uh really important for um like like a a PLM and and managing um engineering component life cycles. Uh

40:29

because uh a lot of the documentation is easy to lose track of especially if you are creating a new variant of a product and you want uh to show uh heritage of where the drawings come from or show ancestry and a lot of those tools just don't really exist. um at least uh for electrical systems and for lots of types of systems you would develop schematics for like you know PNIDs, plumbing and

40:56

instrumentation diagrams for fluidic systems. So uh yeah yeah we're we're starting to create an environment where people can um collaborate and create drafts and uh save snapshots uh release drawings. So I think we're also going to add like kind of a commenting and approval process so everyone's again uh

41:15

comes back to like looking at the single source of truth of your system definition. Very cool. Can you talk with these diagrams like through a co-pilot or something? Um we're adding that in. We actually um that's that's kind of the the current

41:31

project. we started um kind of adding like a a a rag search for if I like want to add a um a device and I have you know a a library of components but I also have uh we artifact has a public library of components um so all your like cables and connectors that you might use like in an aerospace environment for example we've got those all built in and uh so like you know step one is we uh hooked

41:57

things up to a rag search um so you can kind of give a a less structured query and find the component you might be looking for. And then yeah, we're going to expand this interface. Um I've also got a couple tools in development that will add items to the diagram for you and connect them. Um so actually one cool tool that is um not quite uh camera

42:15

ready but just about will take different devices uh take in the context of your diagram and then make connections for you. So you're not having to kind of spend time being clerical and drawing all the connections yourself. That's really cool. Dude, what if you could tab complete the connections? Well, that's what we want. Yeah, we we

42:34

really want like a like a smart model that will um like like tab to complete but but for a variety of like drawing types of operations. Yeah, Obby, you just had to find a way to get tab complete everything into every episode of this. So, yeah. Yeah. Well, I mean, Cedar, when Cedar

42:53

came on, I was like, dude, I want to tab complete everything, which I believe they can do. So, um I think I don't think Corbin's far away from having a tab completion on diagrams. There's there's a lot of copy paste that can go on with building these types of diagrams where like you might have like a you know data acquisition system

43:10

connecting to a thousand sensors. Um and you you want like a a co-pilot or a tab to complete type of uh aid to help you make all those connections. You don't want to have to draw that yourself. uh

43:22

currently there is someone in a documentation process drawing all those things. So I think that'd be a good candidate for for that type of feature as well. How's the user feedback been uh since you know you guys have some customers like playing with this and actually using it, right? Yeah. Yeah. We've got people who've made

43:40

parts for airplanes using this. Uh so it's been super exciting to work with them. Um the feedback's good. Yeah. like

43:47

um they're they tell us, you know, stuff that they're annoyed with all the time and we we fix things for them right away. Um but the the the feedback's been um positive. We're kind of like we really look for in uh working with the customer like people who are excited about building the engineering tools of

44:04

the future. Um and so it's it's a lot of uh productive back and forth with them. It's they're they're pretty exciting to work with because they are doing some uh pretty sick stuff. That's awesome. So Shane and I talk a

44:17

lot about like and I think we were talking about this at the eval segment too like you know you guys are considered a vertical startup which in our YC batch was like the the batch of vertical startups. Um what advantages do you think being vertical like helps you compared to like other kind of AI products out there? Yeah, I was I was

44:37

thinking about that um even when you guys were talking about evals as well because it really is like you know there's you you want to we I feel like we know the um the style of our customers pretty well because we've we've been there and done that. Um I feel like we know really a lot of the value comes from like knowing um what

44:59

frustrates them. Um, even down to like implementation details of like, you know, they these these engineers like know their part numbers. You know, when you're designing a co-pilot for them, you don't need like a a co-pilot that says like, "Hey, show me some suggestions for what type of thing I can use for this or that." It's like that's

45:18

not really important. Like, they know off the top of their head the, you know, the nine-digit part number of what they're looking for. Um, and so it really is like being very surgical and tactical with like how do you add uh the most value with the least overhead. And uh it's it's a lot of those types of things that I think you get out of being

45:37

like really familiar with your your vertical environment. Um that's that's what we're finding with this because we've made these drawings before at a job using bad software. Um and so now we're we're kind of just yeah building what we wish we had. So when you hire for your company, are you looking for

45:54

people with the background the same as you or are you just hiring like crack software engineers? They may not have the context, right? But yeah, kind of both. Like I like people who people who uh really carry around a

46:08

sense of frustration because they've dealt with these challenges in the electrical engineering environment before. Uh like that's that's really valuable. Like someone who wants to build a product to solve their own problem. Um and you can find a lot of

46:21

that in the electrical engineering world. So that is like a an a key thing that we look for. Um but yeah, also just like there's there's so many problems to solve um that just you know enthusiasm for like even just more broad like software problem solving we we found has been valuable too.

46:39

Yeah, I I do see that there there's always kind of for these verticalized agent companies there's kind of two approaches that I see that are pretty common. One is software engineer that software developer now has has an idea that you know they they've heard it from somebody but they haven't experienced it and so

46:58

they're coming from the angle of they're probably really good at building applications you know they they probably know TypeScript you know but they don't know they don't have in-depth knowledge of the problem so they have to go out and find an expert that they can work with right whether that's like part of the founding team which is better or

47:14

just you know um someone that they're basically consulting with or close with that they can get that knowledge from. Uh but on the other side you have kind of what I think would describe you is you've kind of experienced the problem firsthand and you're you know electrical engineers by trade but then you're like well we'll just learn the software stuff

47:31

because we know we need to solve this problem with software. So you're coming at it from a completely different angle, but you don't have to hire in the outside expert because you, you know, you are the expert, at least for your certain niche, right? Like you've experienced the problem firsthand. You

47:44

can relate to the customer. And I do think that both can work, but I do think that there's a higher likelihood of the second one working. So I think it's easier to learn software than it is to like teach yourself and like really empathize with the users. Yeah. Because like Corbin, you went to graduate school and all that, right? Like Yeah. Yeah.

48:00

Can't replicate that Yeah. Well, I hope I hope you're right, Shane. I mean, like I like I said, like I have a a new respect for software developers after trying to figure all this out. Um,

48:14

yeah. Yeah. Yeah. I I mean, but I do think that obviously there's value in both, right? If if you are the e the domain expert,

48:21

maybe you need to bring in some software developers to help on that side. But I think people can, you know, I'm a soft maybe I I say this because I'm a software developer. I think it's especially with these AI tools, it's easier to learn software than it is to maybe learn like core competency or takes less time because you can hire people to help with that. Um, but if you

48:38

are on the other end, if you're someone out there and you're like, well, damn, I'm a software developer. How do I build a verticalized agent? Then I would say you need to find someone that has the expertise and then either bring them on the team or, you know, bring them on as a founder or somehow get them heavily invested so they want to see it succeed because you need that. You have to have

48:57

someone that can empathize with the user or I don't think you can build a product that's going to be successful. Yeah. Yeah. There's there's empathizing

49:03

with with the user and really improving their like day-to-day the headaches that they feel. Uh there's also the more general problem of like you know building an IDE that like solves engineering that uh most most of what engineers do in the hardware environment like hard to throw out a percentage 80% maybe a lot of it's just like clerical

49:24

work. A lot of it's just like holding a lot of good engineers can hold a lot in context in their brain at once. Um they know their system, they know the systems that they interface with, they know the highle vehicle or system requirements, they know the low-level requirements. Uh and they uh they can hold a lot in their

49:41

brain. A lot of times that correlates with experience as well because you've seen a lot of things in the past. Um so also just part of the ex exciting problem to solve is like how do you encode that way? How do you how do you

49:54

invoke enough uh formalism uh for engineering problems that mean that that give you quick and reliable responses to questions that you have about requirements or showing that a design meets the requirements or gener generating the outputs and documentation that you need in order to present to a manager uh or to uh like verify and validate the requirements of the system.

50:19

So uh yeah it's overall like it's a very open-ended problem. There's a lot of uh cool startups working uh on related things to us and uh hopefully hopefully we'll solve the more general engineering problem as well. Yeah, for sure. Um do you have anything else you want to tell the audience, Corbin?

50:38

No, I don't think so. Well, thanks for being here. That was awesome. Yeah. Yeah. Good uh good to see you

50:44

guys. Yeah. I'll see you later. Yeah. See you. Yeah. Great to see you.

50:52

Yeah. And if there anyone that wants to check it out, check out their website, artifact.engineer.

50:58

Obviously, you you'll know if you're in that that world. A lot of us probably aren't, but maybe some of us are. Maybe some of you out there are actually, you know, interested in that in that problem. I'm so bullish on vertical AI founders.

51:10

I just super bullish on them because they know what the like we were talking about, they just know the subject matter a lot more. Yeah. I mean, like I said, I think it's just easier, and I'm biased because I software comes easier to me, I suppose, than some of that other stuff, but I feel like you can learn that. Not saying it's easy, like it definitely,

51:27

especially if you want to get good, it does take a long time, right? Yeah. But I think you could, you know, you can find people that are good if you needed to. As long as you have the subject matter and you can, you know, what you want it to be and even if if you get up

51:40

to your level where you can't get it to where you want it to be, you can find someone that can help you get it to that level. Yeah, totally. Because you have the taste, too, and the experience. Yeah, I mean, yeah, taste is definitely an underappreciated asset in building a

51:53

startup, right? And yeah, I think we saw Artifacts original when they were working on it back in YC. It has come like such a long way, right? Remember like I don't even know if you remember the Yeah, dude. I haven't seen it first.

52:05

I haven't seen it for a while. I know you you've been working more closely with them, but yeah, it it is it's looking really slick. It's sick, dude. It's tight. Yeah. Uh yeah, I mean I remember before like when

52:16

they were they first started using Mster it was basically to like take spec PDFs, right? And like extract the data from it and then like design the thing from a PDF and maybe they still have that as like just but that's probably just like one little feature of course like the whole ID and that was like the whole product back in YC, right? Is just is that

52:34

but I guess that just goes to show you it's like you solve one problem, you get customers and they give you feedback and then you iterate. Yeah. And it's like it's kind of like if we looked back at like the first versions of Maestro that we were releasing, dude, they're so different than what they are today. Yeah. You just look back doesn't even look

52:51

Yeah. It doesn't even look like you can't even recognize it anymore. Yeah, totally. Maybe we should uh bring

52:58

other verticalized companies and kind of see what they think about evals and everything. Yeah. Yeah.

53:04

In the next coming weeks. So, if you're a vertical AI startup and you want to chat with us, reach out. Yeah, you can definitely uh we're relatively easy to find. You can

53:15

find me right here on X. You can find Obby as well. So, yeah, just reach out. We're always looking to talk to people doing interesting things, building

53:28

building verticalized agents, building anything with AI. We're we're always looking to talk to cool people. Yeah. And if you're in our YC batch, you know that we're about to come hit you up. So yeah, that's that's just like any

53:41

anytime we want to find some cool people to talk to, it's like, hey, just hit up the the YC Slack. Yeah. Um before we go to news, I'll just we'll go through the chat real quick. Uh so Hashim the homie, eval

53:54

the best products have been made without evals. It is pretty wild the contradiction there. Uh from Ry Guy Digital. Hey all, question. When hosting MRA agents to

54:06

AWS, if Bedrock has eval, would we utilize Mastra's evals as well? I'm new to eval by the way, so feel free to reframe the question. I think we talked about this, but you should be writing your evals if you want to, right? If you care, should be writing evals for your

54:23

own product. I wouldn't rely on Bedrock's evals are for Bedrock, you know. We have Audi Singh, a fan of ours from Agent Mail. Good luck on

54:35

or yesterday was YC demo day. So hopefully that went well for you. And good luck on demo day.

54:41

Yeah, it was alumni demo day yesterday and I think when is the actual demo day coming probably this week. I think it's like it might be Wednesday. I think it's Wednesday, but maybe it's maybe it was today. I don't know. This week though in the next few days.

54:52

Yeah. Uh let's go Corbin. So, artifact was chilling. And do strict evals add a

54:59

lot of latency? Um, they potentially could if you're blocking on them, but we do is we run them asynchronously. I think you all should probably do that, too. Yeah. So, so again, let's let's talk about that just really quickly. Two

55:11

different types of evals. If you're running evals like on your PRs, like in a CI process, that's going to add latency to before you actually get things to merge, right? So, it's like, of course, it's like a test suite. there's going to be latency there. If

55:23

you want to wait to see the results, yes, they'll add, but it's that's just like latency to your development team, right? And then it's like production latency. It your eval shouldn't add latency because in my opinion, you'd call those guardrails. Like if if you're going to add latency with some kind of like evaluation, then you should be able

55:41

to block the the output because you might want to say like, hey, this is going to release sensitive information or this is kind of like harmful and not safe. So if you've ever used, you know, like chat GBT or, you know, where it's maybe doing image generation, it stops halfway because there's copyright, like

55:57

there's guard some kind of guardrails built into those products that block, you know, as either as things are streaming or before the stream starts to make sure that the response is not going to uh, you know, be harmful potentially. So your eval shouldn't actually add any latency if they're done, right?

56:14

Yeah. And then I guess there's a third vertical where you're evaling data you've collected. So it's not impacting the user. You're just you have a data set or you have a bunch of different points. You're like, I want to eval these and then you do it or score them

56:28

or how whatever the new word is after all this eval drama. I'm sure there'll be a new word and then you know. Yeah. But typically that's the kind of thing that you might not be running it on CI, but you could if you wanted to,

56:40

right? You you probably have a data set. You're just testing things. Maybe you just it's just like a one-off, but you're running it on some kind of like

56:46

script, right? You're just like, "Here's the data set. Run it. Run this set of evals against this set of data and let's see how it performs."

56:53

And I would say that's probably really useful when you're trying different models or, you know, model updates or things like that. Yeah. Shall we get into the news?

57:05

Yeah, let's let's do it. So, not actually not a lot this week, you know, besides the Evals drama there. So, nothing uh nothing groundbreaking, new new no new big models dropped or anything like that, but there are some cool things. So, the first one, and this

57:26

one was interesting because you'd have to be paying attention to know this, but there's Microsoft released something called Vibe Voice, which is a Frontier open- source texttospech model. And so, you can obviously we can listen to some I can't believe you did it again. I waited for two hours. Two hours,

57:52

not a text. Do you have any idea how embarrassing that was? Just sitting there alone.

57:57

Look, I know. I'm sorry. All right. Work was That sounded like my ex-girlfriend.

58:04

Use a voice model for this. I don't know. Uh, but you know, we can play through all this. I'd recommend, you know, you go through and and get it.

58:10

But, so there's singing. I want to hear this. Hey, remember see you again? Yeah. From Furious 7, right? That song

58:16

always hits deep. Let me try to sing a part of it for you. It's been a long day without they that up. Yeah, it's not great. But this is on their, you know, their example. But it

58:34

is singing, I guess, which is something. Yeah, they got hit the notes like Charlie Puth. Welcome to Tech Forward, the show that unpacks the biggest stories in technology. I'm your host, Alice. So

58:46

anyways, it's people were excited because it was a pretty good text to speech model and it's now open, right? And people were comparing it to 11 Labs and saying it's better for this, it's not as good for this. But regardless, it was pretty competitive and people were very excited because it was an open model. So this was released

59:04

on August 25th. We didn't even talk about it then because it it actually, you know, it didn't even end up on my radar. However, uh something that was kind of interesting is that then they removed it. Oh wow. They just like pulled it down. It was open source. It was on GitHub. It was on Hugging Face and they just pulled it

59:22

down last week. And now just I think just yesterday or I don't know the exact day they put it back up. So my question is why did they pull it down?

59:33

there's something they must like I don't know what what changed or maybe like someone released it from Microsoft and then they said like no we weren't supposed to release that yet pull it back and then they're like well now the cat's out of the bag let's just release it back again so there's something going on internally there but I thought it was very interesting that they would release

59:50

something as open source then pull it back and then people on Reddit were freaking out saying oh like did you get a copy like push it up all these different like by voice clones on GitHub are out there now of just like people putting it up because I think It's either MIT or Apache 2. Like it's open source. I don't remember what. Guess we could look what what the license is.

1:00:08

Um it's MIT. So, uh but yeah, I just thought that was kind of interesting. Yeah. I wonder why. Yeah. Something up. Yeah. Like did was it like a safety

1:00:21

issue that they realized like so here's the actual repo, you know, has 8.4K stars. So yeah, if you want to give them a star, if you're in GitHub though, you know, give us a star, too. find the

1:00:31

master repo. But yeah, you can see the license is is MIT. So, a lot of people have there's a lot of nonofficial forks out there because they they pulled it down and then all these people said, "No, it was open source. Here it is." And yeah, very very interesting.

1:00:49

There's a funny uh comment in the Reddit thread, which is 11 Labs probably threatened to become courtroom cryb babies. Yeah. Yeah. I don't know. I don't know what it was. But there obviously was

1:01:01

something that uh that caused them to pull it down. Yeah. But it it is good to see more open models. I always like to see that. You know, I did a lot with text to speech in

1:01:13

the past. So, I haven't tried this besides listening to the pre-generated stuff, but I imagine, you know, a lot of people that if you are doing text to speech, it'd be useful to test out, compare it, especially like price and latency wise to whatever you're running, which is probably 11 Labs or or one of the competitors.

1:01:33

You know, one thing I didn't realize is not all countries have access to all of these models. And I just didn't realize that. we were meeting with some people who don't have access to open AI and stuff and even any of these other models. So, it just makes sense why I think I'm just I guess I'm bullish on

1:01:51

open source because how else are these other people going to work with AI in other countries? So, like the Deep Seeks, the open models like this like Yeah. Yeah. And the other thing that's interesting,

1:02:02

and I I'm not an expert on this, but it's also a lot of those models don't have the training data to even be good at different languages, right? And a lot of these different countries their natural language is something else. So um it is interesting just what because there are certain models like 11 Labs has their own like multilingual models that are better for different

1:02:21

multiple languages right different languages and because that's a that's a use case too right just live translation of of people speaking and taking one from the other and you being able to generate audio in different you know across different languages. So, but but I I would imagine that these if we can get good open models, maybe they'll more

1:02:39

multilingual support will will it'll get even easier than it was, you know, previously. Yeah. And that makes sense for Microsoft to to be in.

1:02:50

All right. So, continuing on. So, it wouldn't be uh a normal show without talking about some kind of legal issue with AI or some kind of safety issue. So, Anthropic has settled a class action

1:03:04

lawsuit for 1.5 billion and kind of the TLDDR is that basically Anthropic agreed to pay 1.5 billion to a group of authors who accused Anthropic of using their books to train its AI chatbot Claude without permission.

1:03:23

Wow. But this kind of sets a precedent because that is what literally the definition of all these companies are doing, right? like they did not get permission for every book or every piece of content that they they kind of put into the training data. And so what kind of uh and this has been kind of ongoing

1:03:40

and there's all these different like settlements, but what kind of what's the ripple effect going to be? Yeah. Yeah. Is that why they raised the money so they could pay off these uh these settlements?

1:03:53

Maybe. And then there's another one. So authors are suing Apple over its AI training as well. So another group of

1:04:00

authors. So it's just now that there's like some precedent, you got to imagine that they're going to, you know, group together, do these like class action things if they're content creators and get their yeah their data. And the only one, so here's the thing, the only one that might be in the best shape is going to

1:04:17

be Google because they're using like YouTube and all this data that you probably sign over all the rights to when you're producing content on their platform. So maybe it wasn't directly, you know, specified, but they're probably much more likely to be in the clear or at least not in as much, you know, issues as some of these other companies that are just kind of doing the open the open web or open internet

1:04:42

and taking books that, you know, maybe they they didn't get permission to drain on. Do these lawsuits or the efficacy of these lawsuits, does that support Cloudflare's narrative? in terms of, you know, people should pay to play on this.

1:04:59

If if if all these content creators are going to get paid out by class class action like isn't that like in line with their narrative of the web? That's theirs. I think that's Cloudflare's narrative is that you know I I I you know I'm not a Cloudflare expert and I don't agree with necessarily what their approach but I I

1:05:19

believe what they're trying to do is say they want to give authors the choice if they want to be paid they can at least you know put they can put the ticket in the door and say like you have to pay this if you want to get get in right like here's the ticket price do you want in do you want to be able to train on this data do you want you know and then model companies could decide yes this data is

1:05:41

valuable I will pay that now again I think that's only going to work for those kind of medium scale authors or content creators because the big the big ones are just going to do deals you know behind the closed doors and the small ones they're just going to ignore I think because it's not worth it but maybe they're you

1:06:01

know maybe they're on to something I I personally like to think that content's created to you know, especially content on the internet is kind of meant to be open. That's the whole idea of like the free exchange of information, but you know, time will tell how it all shakes out. So, like the lawsuit against Apple is

1:06:23

because they pirated books. So, maybe it's, you know, things that actually have I don't know what the right legal term is, like they're like trademark or explicit copyright. Copyright. Yeah.

1:06:35

Yeah. So, I guess that's where it is. And and those for those of you listening, no, we're not lawyers. This

1:06:42

is not this is not legal advice. We're not electrical engineers. We're not lawyers. We're just software dudes.

1:06:48

Yeah. Yeah. So, you know, what the hell do we know, but you know, you you get our opinions either way. We'll we'll

1:06:53

tell you what we think. Uh yeah. I mean, and so maybe the that's why it's kind of starting with like book authors because there's probably a little bit more of a claim that says like how did you even get access to this book? Did you pay for it? And just

1:07:07

because you paid for one copy, does that mean you can train and use it in your data set? Like that that could be like some kind of reproduction you could argue of, you know, copyrighted material. Yeah. And I don't know exactly how copyright on the web works. It's got to be similar though. I just think that if

1:07:27

it's a little easier to argue that if you're letting it be scraped by search engines, you can let it be scraped by, you know, LLMs or used in training data for LLMs. So, I think that's probably a little harder to, you know, to get a settlement for that, but I'm, you know, people are trying. All right. Now, continuing on, this is a little bit more on the safety side of things. I had one I have a joke.

1:07:53

Okay. So it's like but if they train on the principles of uh AI agents then we can get into class action lawsuit and get some money. Yeah. I mean it's it is uh yeah it's not

1:08:07

completely open right like no but if you do want a copy if you do want a copy you know small plug you can go to master.aibook AIbook and get your copy free and you can train your own, you know, your own LLM, your own brain on on the principles. Uh, so this one's a little bit more on safety, but OpenAI was kind of I guess

1:08:33

they were pressed about the safety after there was apparently a teen suicide after talking to Chad GBT, which is of course incredibly sad. You know, you never want to hear that's rough. You never want to hear that, right? It's

1:08:46

and of course we don't know all the details. We only know what we've been told through some of these various you know news art news outlets and and things but there's essentially been a lot of push back on how what's the safety of these things and if should if teenagers and younger kids have access to this and they're asking questions. At what point does the is is there

1:09:09

responsibility from the model company to take responsibility for what the chatbot might be saying that could lead someone down to making a decision, right? Yeah. They say absolutely right and you said some weirdly bad They're like absolutely right. You know, like that can't they have to have guardrails for that type stuff, huh?

1:09:27

Yeah. Exactly. Like if you if someone's saying like I I don't think I should live anymore and they the chatbot says you're absolutely right. Well, that's you're right. You that's pretty messed up. Um, and then there's a Washington Post

1:09:39

article that says Meta suppressed some research on child safety. So, that's obviously concerning. And all this stuff just came out. You know, it's been going on for a while, but a lot of news articles have been picking it up the

1:09:50

last week. And then there's a TechCrunch article that says Google Gemini has been dubbed high- risk for kids and teens in a new study. So there's, you know, all the model companies are getting heat on safety and how they handle, you know, kids and teens and how they, you know, how usage is of those products with that age demographic. What if like Chad GBT is cyber bullying

1:10:13

these kids? That would be so up. I mean, it's probably not intentionally doing it, but you because you can kind of guide the LLMs. I can get the LMS to say to agree with me almost all the

1:10:25

time, right? Yeah. And it feels good. It actually kind of feels good sometimes, you know, to just

1:10:30

be like, "Okay, I really think this." And then you kind of push them a little push the LMS a little bit and it will agree with you. It'll almost always go down the path because I think the vibes people like when people agree with them.

1:10:42

That's just natural. You you you feel like it's hearing you and it's understanding and it's siding with you. And so that's a good feeling. And I think kids are going to have that same, but they're not, you know, even less

1:10:54

prepared to regulate and self-regulate on understanding that, okay, this is just a machine that's kind of built to side with me. It's probably not actually doing that. And so, yeah, it is a is a challenging situation. Kids may not know that this is not actually super intelligence, even though

1:11:11

they've been marketed that this is AGI and right? People don't know that. Most casual users just think it's magic. They don't

1:11:18

understand how be worse for kids, right? Yeah. Because you're like, "Oh, man. This is like Chad GPT overlord here." Actually

1:11:25

telling me he knows all my homework. Yeah, it starts with homework. I mean, it probably I bet it starts with homework and then you ask it some personal questions because you're having some tough times in school. It helps you out

1:11:38

and then you start having these long conversations with it like it's a friend. Yeah. And I mean my so my hot take is that you should be able to have parental controls on these things just like you do on other types of tools. I should be able to and my kids aren't old enough,

1:11:55

right? But if they were, I would want to be able to one monitor their chat history because I want to know just like if I I wouldn't allow my kid in a chat room, you know, but maybe when they're, I don't know, old enough, they're going to be talking to their friends on different, you know, platforms, but I

1:12:13

would want to be able to monitor it somewhat. But ultimately, I'd also like to have some kind of alert of like, hey, your kids been asking some interesting things. We're alerting you of this situation. Maybe you should look into

1:12:24

it. like but non in a non-prescriptive way. It's just saying like hey here's here's a something we've noticed and I I think that the model company should be doing something like that. So if I want my kid to have access to chat GBT I can

1:12:36

say sure here it is under my account that I pay for but I also have these parental controls that are within my reach similar to like all the video games that kids are playing and all that. Yeah. Yeah. Man, it's tough out there for

1:12:49

parents, too, because I remember like Facebook, I mean, all these platforms start having these safety issues. I guess this is a little different because you're talking to a robot, but I mean, it's the same kind of effect though. It coerces you to do things and it's just not good. Maybe like in the future you can like have these guard rails in the

1:13:08

parental app control part of Chad GBT and you're like no harm, no violence, no sexual things, no nothing like that. Yeah. and and maybe like some kind of alert settings that say like, "Hey, if it if any of these things come up, please alert me." Yeah. Yeah. And there can be some sensible defaults, but you can customize it because you

1:13:26

might know your kid's struggling with this. So, you want to say, "If they talk about this, just please alert me." And then whether it's the parent trying to get directly involved, whether they bring in, you know, a mental health professional, you know, ultimately, I think as a parent, you do have to take

1:13:41

some responsibility on yourself. Like if your kids are using these tools, you should know what tools they're using. And I think that's the hard reality people don't like to hear. It's easy to point the finger at the company and say

1:13:52

they have to do everything, which I do agree they need to do better. They should. They they own some of this responsibility. But yeah, ultimately there's a lot of things out there that you can't control. And as a

1:14:02

parent, you you're kind of are putting it on yourself that if you if you care, you're going to have to be more involved. Yeah, for sure. I mean, Anthropic's already reporting people, so I mean, it's only a little step further to report to parents, right? Yeah. Yeah. They'll call the cops on you, right? So, it's like the cops, call your mom like, "Hey,

1:14:20

dude. It' be some text to speech that calls calls you and has a conversation with you about your kid's activity on their has some questionable questions here." Yeah, let's go through it. Uh, yeah, I

1:14:33

mean, it obviously it's a tough situation and there's no easy answer, but it is you did raise a good point. It's not that much different than other platforms. Like this has always been an issue. Anytime there's new technology, it's like now how does it impact kids?

1:14:46

Whether it's video games, whether it's social media apps and now the the difference is they're not talking to other people and getting, you know, bullied or maybe like persuaded by others. it's maybe getting persuaded by their own, you know, kind of the own their own one-tracked mind or like they get on this like tangent and they get into these long threads and then they can kind of get led down to maybe

1:15:08

thinking some things that they wouldn't have normally thought about if they were talking to just like an average person. Yeah, true. All right, so let's talk a little bit about Open AI.

1:15:22

So, I think this was only just a matter of time, but yeah, OpenAI's coming for Hollywood with Critters, an AI powered animated film. So, the goal $30 million budget. Yeah, which is relatively low in production value. Uh, compared to some

1:15:42

of these other large featurelength films. Let's see what a Let's see what what's Wall-E costs. uh production cost 180 million to make Wall-E 180. Yeah. Yeah. So you can think about imagine you

1:16:02

can do that for a sixth the cost and will it work? I don't know, but I I am actually interested in seeing it because I would love to see I do imagine there's it's something in the middle, right? There's you can use these tools to produce better content, but yeah, I don't know. Yeah. What if I really want to see the

1:16:25

movie, too, because what if it's trash, you know, and it's like, could be. Okay, so so this is a tangent, but it's kind of related. So last week I was listening to my, you know, to just listening to Spotify, my Discover Weekly, and then there's a song came up and I was like, "Oh, this is this one, you know, this one hits pretty good." So I listen to, you know, what I always do is listen to a couple other songs by the band and I was like, "Yeah, not bad."

1:16:49

Like this is kind of in my genre. I sent it to some friends and then one of my friends says, "Yeah, that's AI generated." What? It's the first time I've ever like I didn't know

1:17:00

I didn't know that the band was AI generated and then I looked it up. There's no trace of it online. It hasn't been like they just released su singles. It hasn't been around that long. Some of their songs have like more of some have female singers, some have

1:17:12

male singers, but it was like okay. Okay. So, yeah, it does seem like it's not like a consistent thing, but obviously they're getting t they're getting hundreds of thousands of listens a month and it I'm assuming there was still someone like curating, right? They're like someone's curating that, but it you if you listen you could kind

1:17:29

of tell like that kind of sounds like this other singer. I could see why it like it kind of caught my attention is because it sounds like singers that I'm used to listening to. But that kind of like blew blew my mind and I as a musician, you know, I got the guitar hanging on the wall. Like I don't know how I should feel about that. Like I was I was duped. I was duped by

1:17:50

the AI. It's getting better. Like I heard like some like I heard like a Kendrick Lamar song that was not even Kendrick Lamar. It was just AI generated and it sounded pretty fire, dude.

1:18:02

Yeah. And so it's like, you know, obviously there's still someone that's spending probably a lot of hours right now like using these AI tools and crafting something that is obviously compelling that they couldn't have done without the AI tools, right? They couldn't have got that sound. They couldn't have got, you know, they they probably used it to

1:18:20

write like a lot of the stuff, the lyrics, the you know, they probably had had it trained on other like s different sounds that were similar. So it was able to get voices that sound similar. But there's so there's is like an art to it. So as a mu like as a creator I kind of think that's pretty cool, but also it it kind

1:18:38

of like hurts the the craft a little bit if you're actually a musician. But I could see the same arguments for people that are like artists and these generative AI tools. And it's no different than, you know, the the artists that then didn't like that there was digital tools like Photoshop. It's like a different

1:18:54

tool. So on the one hand, it's just a new tool, but it's a super powerful tool. And so I I can kind of see both sides. Ultimately there's no stopping it. So I think like we're on this path. There's g

1:19:06

you're going to end up listening to like AI generated music or at least that use AI tools in in the music just like a lot of you know music is just electronically sampled and you know pieced together anyways now. But the same will be true with video and movies. I think there'll be parts of movies that are probably heavily AI generated and you won't even know the

1:19:29

difference. Yeah, I know this founder in SF. Her name is Alice Chen. I believe she's like

1:19:35

a movie a film maker uh and for did animated films and I guess films in general and she's building a startup that use leverages AI to help build like make more films and things like that. So, like I hope the OpenAI creators thing is good in a sense that maybe there's a whole new product category for these types of creators to build, you

1:19:59

know, featurelength movies or something that, you know, take them half the time or whatever. Um, yeah. Yeah. I I mean, it is cool for the idea that a solo creator could now build something

1:20:13

significantly more compelling than they could ever have accomplished before. Absolutely. It kind of really empowers the one person to do more where you need, you know, again, 30 million is still way out of reach for most people. But it's kind of like, you know, like in

1:20:27

a way for filmmakers like when you got your first iPhone or whatever and people are using it like had a good enough camera or your first cell phone that had a good enough camera where you could record good videos and then you had all these people creating video that couldn't before. It's like and this is like just another step now. You don't need maybe what was

1:20:44

once like a $10 million budget a low quality film could now be done in you know 10k right it's like you just spend some tokens and you know spend a lot of time but one person could build something that's pretty cool also like those concept like the concepts of like films and stuff you know how they do like those demos but then they it costs money to build those

1:21:03

demos to then pitch it to studios etc. That cost goes down now, too. You can make a YouTube short or something to test out your idea and then try to go get money for like a film or whatever. Yeah. And you can replace you could

1:21:16

basically build you could you could make it easier to prototype. You know what? Yeah. What I think typically they do like a

1:21:23

storyboard, right? It's like a static storyboard. Well, your storyboard could actually look and feel like a real film a lot more easily now for a lot cheaper. And it's like animated storyboard rather than just like a static one. and it uses

1:21:35

some AI tools and you actually build something that feels compelling. Yeah. I don't know. It It's interesting

1:21:41

times. I wonder how many tokens it took to to make critters. They should like post all the the stats, right, for Yeah. Well, is it is it actually launched yet? I think they're Yeah, it's it's not expected till 2026.

1:21:55

Dang. Okay. So, so it's still early. I am It's report

1:22:01

So, it's all kind of reported. Yeah. and they're they're trying to do it in nine months. Uh so I think again

1:22:08

it's probably midway through if I were to guess right now based on yeah the timing. So they're they're not I would say they're not done yet. But yeah, and the comments on this are on this article are all like, oh, if you if you bring AI to films, then it removes the art of films and the essence of films. So that discussion will blow up, I'm sure.

1:22:30

Yeah, it's exactly what I said about music. It's the it's the same thing. And I I can see both sides because as a musician, I can get that from the music side. I imagine like filmmakers are going to get it that same, you know,

1:22:41

but what is what is probably going to end up happening is people that use the tools are going to be more productive and eventually just like any other new tool that comes out, the people that use the tools end up it's like nobody makes I shouldn't say nobody, but very few people do handdrawn cartoons anymore, right? Yeah. You're just using software to do that

1:23:01

stuff. And I think so the number of animated films and the quality is gone continuing to go up, right? And the films are higher quality today than they were 20 years ago.

1:23:12

But, you know, I I think that's only going to continue, right? It's like that that trend is not going to stop. And obviously AI is going to help people do things cheaper and faster than they ever could before.

1:23:23

Yeah. Khalil has a good comment. I think it's just like coding.

1:23:30

Yeah, not everyone can use AI to the same potential. So, yeah, that'll be that'll make some filmmakers better. Verticaliz, I guess a filmmaker is a verticalized agent, too, if you think about it. Yeah. Well, I mean, there's going to be

1:23:41

there's all these film tools, right, that are coming out. And the ones that not only the best tools, but the people that learn how to use the tools the best, they're going to produce the most compelling films. And yeah, ultimately, there's still, in my opinion, there's still some artistry to that. And yeah, you still have to have taste. Even

1:23:58

the dude who made the music, right? He had to have taste to or they should whatever had taste to push publish it on Spotify and stuff. Yeah, exactly. All right. So, just a couple other things. These are some TechCrunch articles. We'll just kind of

1:24:10

highlight through them. The personality shaping team for OpenAI is moving under the postraining team. Okay. So just like a organizational shakeup, but then and and this is kind of

1:24:22

interesting because we know the you know Daniel from Alex codes but Open AI has kind of aqua hired the team behind Alex codes which is an Xcode assistant. So if you think of like cursor you cursor has an agent or VS code you have like a a co-pilot or whatever you have the chat window. Essentially Alex codes was an Xcode assistant. So if you

1:24:43

ever built like an iPhone app, you you'd use Xcode. And so I think this was really a talent grab from what I've seen, which is but it's interesting because a lot of people were using Alex codes if you if you were one of the Xcode developers, right? You were like using Xcode every day. I think again it's kind of more niche,

1:25:02

not is it's not as general as like cursor, but I think they had pretty good traction. So don't know the terms of the deal, but it must have been pretty good. Yeah. And uh they're a YC alum as well.

1:25:15

So and I think Sam's friend. Yeah. Yeah. We we've we've met with I think

1:25:21

his name's Daniel. We met with him once or twice. Yeah. All right. So open router this was

1:25:27

September 5th said introducing Soma Alpha two new stealth models. So 2 million token contacts which is huge. Wow. But they have Soma Dusk Alpha and Soma

1:25:38

Sky Alpha. So for those that don't know, Open Router will kind of release these stealth models and often you can sometimes guess who's behind it, but it basically allows teams to kind of test their models before they with real traffic before they release them like with real users. So there might be some new models coming. Who do you think owns it? There a lot of

1:26:02

the speculation. I don't know. Let's see what Grock says here. Crock didn't answer. Oh,

1:26:12

origin performance hints at a frontier level model from a major lab. That's all we got. That's not very helpful. So, that one dude is like Gemini. I

1:26:23

think it's anthropic, but I don't know. Maybe not. We'll see. Or if it's XAI,

1:26:30

I don't know. I mean, XAI would make sense, but they've released not that long ago. So, it'd be kind of interesting to have them release again so quickly. But maybe

1:26:37

who hasn't released lately? Claude. I mean, yes, Sonet 4, right, was the last major one and that was that's been a little while. So, Claude's due. Gemini just had its moment,

1:26:51

so maybe it's not that. GPT5 had a moment already. Yeah, it's got to be I I would anticipate like Enthropic would be my number one, then XAI, and then Google. Yeah.

1:27:03

Yeah. We should take some bets, dude. Yeah. Yeah. We need a we we we need to

1:27:10

get some uh what what's the betting platform called that you can um which one? There's a lot. Yeah. Somebody would post it in the chat. Um

1:27:22

but yeah, that you can uh if you're I think you got to be outside the United States. Anthropic's never done this stuff. Okay. Well, maybe maybe they are now or probably not.

1:27:34

I'm curious now. Yeah, I guess we'll wait and find out. All right, so so Jay came in, but they are looking to do stealth as Daario has mentioned in some talk.

1:27:45

Okay. All right. All right. We're putting together the pieces. Poly market. Thanks, Khalil.

1:27:50

Thanks, K. I don't know. Obviously, I've never been on Poly Market before, but I do know it exists. Wonder if this is already on Poly

1:27:56

Market. I guess I'll take a look. Yeah, it might be. All right, continuing on.

1:28:02

So their AI engineer has announced the AI engineer code summit. So this is going to be in New York, you know, Obby where you're at right now, but November 20th, 22nd, 2025. So if if you are uh you know, I think it's more specifically for developers, right? For people that are like using

1:28:23

code tools and agents, right? So cur cursor agent, copilot, wind surf, you know, all all the different cloud code, all the different coding tools. So very interesting. Uh if you are

1:28:39

looking to, you know, go to events, meet people especially that are doing similar things to what we're all doing every day, probably going to be a good event. They're always they're always good times. Speaking of that, yeah, we're going to be in Paris soon.

1:28:55

very soon, like next week soon. Yeah. So, if you are going to be around for the AI engineer conference in Paris, I don't know if it's a summit, if it's a conference. I

1:29:07

don't know how they how they market it, but we're going to be there. And if you want, we're going to probably do some kind of like MRA meetup. It'll be small. We just want like a if we could get like a dozen group of like users together and

1:29:20

just have some drinks and chat. That's kind of our goal is like a kind of small intimate group. So if you're interested in meeting up with a lot of the MRO team, we'll have probably like I don't know six to 10 of us there somewhere in that range of people. So quite a few people from the master team, some core contributors, some you know Abby,

1:29:38

myself, Sam will be there and yeah, a whole bunch of other the folks that help us kind of on the in the Europe time zones will will be at that conference with us. So excited to see some of you there. And if you're not going, you know, and you're in the area, maybe look at grabbing a ticket. Yeah, I got hit up by the organizers.

1:29:56

They're they're a company called COB or something that they wanted to meet up with us and say what's up. So maybe we could leverage that to get more people to our event. We'll see. Yeah, it's going to be exciting. So

1:30:08

excited to see a bunch of you there. Yeah. All right, we're almost done today. It's

1:30:14

going to be not, you know, a slightly shorter than normal episode. But this I thought this was interesting. So Verscell has shipped an OSS vibe coding platform. So basically you can launch your own vzero built with Forcell's AI gateway and sandbox.

1:30:32

But essentially it's the goal is like I think it's open source. You can deploy it. It kind of allows you to build your own lovable your own vzero your own you know replet. That's that's the idea

1:30:44

behind it. Yep. So kind of cool. Yeah, if you're if you're in the market,

1:30:49

if you want to build your own lovable now, you can go or you want to, you know, maybe even if you don't want to build it, if you just want to learn some techniques of how you could do it, probably useful. Uh, another kind of small update. We don't have a lot of model updates today, but this was from last week. Kimmy K2

1:31:08

has a September 5th update. That's what the date is, 0905. And you can just see how it compares to Claude Sonnet 4 on Swebench. And you can

1:31:21

see that it actually outperforms on Swebench multilingual terminal bench and very close on the others. So yeah, it is an improvement compared to Sonnet 4. But again, I always look at some of these things as like the models or some of those benchmarks become pretty tainted over time. So, I haven't

1:31:44

used it. I don't know how well it actually compares in real usage, but at least on the benchmarks, it looks pretty dang good. Yeah. Um, and you can get this. It's available on, you know, together AI and

1:31:56

like all the different platforms if you Yeah. trying to get it. I'm assuming it's on LM Studio as well, right?

1:32:02

Yeah. Which is my I guess Quen 3 Max dropped as well. Thanks, Khalil.

1:32:08

Nice. So, if you're into these types of models, go play with them. Go play with some open source models or at least open weight models. All right, everybody. What else you got, dude? Anything else?

1:32:22

Um, no, it's going to be a busy week. We're trying to We're trying to fix all our bugs and get some features out. So, I'm kind of glad it's a slow news week so we can like, you know, focus. Yeah. get get back in the lab and ship some things, right?

1:32:39

Yeah. Back to the dungeon. Back to the dungeon again. But it's been great. Like I think we've burned down

1:32:45

like almost 30 urgent issues. Um like Monster is getting more stable, especially after our V5 support. So just a process that we're keep going through, you know. Yeah. Yeah. That was a big that was a big release. Took us a long time to get

1:33:03

right. And it feels good now that, you know, obviously you have a big release, you're going to have some some issues that pop up where now we've burned a lot of those down and feels like we can finally breathe again and start shipping new things. Yeah. I think my lesson learned from the last couple weeks is like documentation

1:33:20

is so important that even if you're busy, you got to keep doing it because it causes so much like downstream friction. So, if y'all are building products, like make sure your documentation's tight all the time because we are uh paying for that right now. Yes. Yes, we are. Uh but it's getting

1:33:39

better. It's getting better. For those of you, a little plug for those of you still listening, if you haven't subscribed to us on YouTube yet, go do that. Alex on the team has been shipping some

1:33:51

just amazing videos. I think like I'm obviously biased, but if you want to learn about obviously some Mastra like actually want to learn Mastra, it's a great resource, but also there's some like general AI engineering content that is really useful and of course there's always a MRA spin on it, but much like our our book, it's it's

1:34:11

really meant to be more educational. And yes, we do talk about Mastra a bit, but it's mostly around like just introducing concepts and and learning this stuff because that's we're all learning this stuff. And the thing that I like the most about Alex's content, so I'm going to give a little more of a plug here, is that he's very upfront that he is not the absolute expert, right? But what's

1:34:31

really cool is he's has a similar background to a lot of us, kind of like a software dev, a web dev, and he's learning all this stuff. He's drinking from the fire hose, and then he's taking that knowledge and basically putting together what he wished he had when he was introduced to the topic, right? So it's very good where you can kind of relate to the you

1:34:53

know Alex along his journey as he's learning it and you can kind of learn alongside of him. So I think it's a really cool format. So I would encourage you all to check out some of those videos if you have not already. Yeah. Every time you ask us questions we

1:35:05

figure out some way to make the UX better or find bugs actually too. So um it's been great. And with that I think it's time.

1:35:16

It's time. All right, everybody. Thank you for tuning in to AI Agents Hour. We're here

1:35:23

every Monday for the most part, and we'll see you again next week. Peace.