Back to all episodes

Is Claude a Nark? Stagehand, OpenAI Codex, and Operator using o3

May 27, 2025

Today we look at the Claude 4 system prompt, discuss if Claude is a Nark and will call the police on you, try out OpenAI Codex, and talk with Anirudh from Stagehand.

Guests in this episode

Anirudh Kamath

Anirudh Kamath

Watch on

YouTube Spotify

Episode Transcript

0:02

hello everyone and welcome to AI Agents Hour i'm Shane this is brought to you by MRA and today's kind of a bonus episode day because if you were up very early in US time zones or if you were in the you know European time zones you had Abby on earlier with Marvin from the Monster team they talked some AI news they were

0:24

uh doing some live coding but today we do have a very uh pretty jam-packed episode we're probably not going to go as long as we normally do because you already had the the previous episode earlier but we're going to be talking a little bit about what's happening in AI just some news just like we always do

0:42

then we're going to spend a little bit of time and I'm just going to try to get Codeex to write some code for us and we're going to see how it works in case you have heard about it but you haven't seen it in action you haven't tried it we're going to try it together and I have no idea what to expect i have no idea how well it works i've heard some

1:00

good things i've heard some mixed reviews but we're going to put it to the test so we're going to be testing out OpenAI codecs for a while and then we have Anarude from Stageand Browser uh browser base coming on and we're going to just chat with uh chat with Anarude and see you know what's going on in the Stageand world

1:20

and with that uh make sure if you are not already please do uh subscribe to us on YouTube please give us a star on GitHub please follow us on X all the things if you want you can find me on X there I am let's get into some news so the first uh news is OpenAI has upgraded the model that's powering its operator agent so I don't think we've talked about this yet but previously it was a custom version

1:56

of GPT40 and now it sounds like it's using a model that's kind of based on 03 which is of course their reasoning model i'm curious if if you're listening to this if you're in the chat this is live you can add comments whether you're on X YouTube or uh yeah or on even LinkedIn I think so add a comment have you are you using 03 yourself i will admit I use 03

2:23

pretty much every day one of the reasons I use 03 is to help me compile this news so this topic was actually generated by 03 which is funny 03 gave me a topic that's talking about itself maybe there's a maybe there should be a concern there but uh so it is interesting that now operator which if you're not familiar with openai's operator kind of lets you uh it's almost like a way to control a

2:49

you know a a browser instance essentially or you know similar to claude's uh computer use but it it does uh kind of allow you to do things you can tell it to go uh you know go to a website and click on a button for you and it should be able to do that but it is using now 03 under the hood which ideally because of 03 3's advanced reasoning capabilities seems like it

3:10

does give you better accuracy i have not tried operator recently so I'm curious how others are getting on with it but if you you know have used it curious how it's going for you maybe when we get into codeex we can test operator out as well and just see for those of you who haven't seen it in action we can actually just test it out so if you have

3:30

ideas of what we want to try to get operator to do feel free to drop a comment and we will try out a thing or two before we start testing out codecs so that was the first news item the second news item and this one we've talked about quite a bit over the last few days but Claude uh 4 has dropped right so there's a new version of of Claude 4 and there's a couple interesting

3:57

things i think some of the benchmarks I believe you know Abby talked about it earlier some of the benchmarks aren't necessarily significantly better but others have reported pretty good things with specifically the coding ability of Claude 4 but one thing kind of stood out specifically to me and that is that you

4:17

know anthropics has faced a little uh backlash on Claude 4's opus model that it can potentially you know potentially contact authorities if it thinks you're doing something egregiously immoral so on the one hand okay that makes sense if you want if someone's really asking to do something that could be considered bad they're you know kind

4:41

of having that security in there could be good but it's also a little scary that a model could be using your system against you to basically go out and contact authorities potentially if you're asking questions and who who gets to define what egregiously immoral is and does the does the model get to decide that do you know does it always

5:02

listen to exactly how the system prompt was was set up so those are some uh you know maybe concerning things i'll I'll maybe share share my screen and we can kind of look at this specific article which happens to be from VentureBeat where you see uh you know kind of the original tweet that started the started

5:21

the controversy if it thinks you're doing something egregiously immoral for example like faking data in a pharmaceutical trial it will use command line tools to contact the press contact regulators try to lock you out of relevant systems or all of the above now that does concern me a little bit rightfully so I believe does it concern you chat i guess let me know

5:44

i think that you know I am you know I would be curious to know what it defines as egregiously immoral and who get you know where that line gets to move to over time because it does seem like now your your model is able to basically use your system potentially if it needs to if it thinks you're doing something wrong so there's just some like

6:07

terminator type uh concerns that I have here you know most likely overblown but also you know it is a little concerning and the other thing that I thought was very interesting is you know uh Simon Willis does these uh kind of highlights from the system prompt that gets released from Claude and so there's kind of this Claude 4 highlights and I

6:35

thought we would maybe spend some time going through that i've read through parts of it but I did want to just kind of read it together and we can kind of talk through some of the things that are kind of interesting because there is a lot you can learn by reading the system prompts of uh these these different agents or different models and yeah sir

6:55

also believes that this uh it is concerning so I'm not the only one thanks i'm good to know that there are others out there that are you know you know slightly concerned we'll say so yeah let's look at this uh claw for system prompt analysis and we will do some of uh some analysis on Simon's analysis per

7:22

perhaps okay so it kind of has a nice table of contents here um sections in bold is his editorial emphasis so let's just kind of go through so it does give the current date this is kind of the intro introduction so it looks like this is for Opus opus is the most powerful model um okay so we're going to It does try to

7:55

establish its personality so if it seems unhappy Claude responds normally and it tells them that they can press thumbs down on CL Claude's response to provide feedback it does not mention the user that it's responding to okay and I think Claude you know doesn't you know for those of you that are not aware it

8:23

doesn't really make use of memory yet as far as I've seen compared to you know like chat GBT has this idea of memory so it can kind of remember things across threads which is sometimes really helpful but sometimes you don't want that and there are ways of course in chat GBT you can just do like a temporary chat and it doesn't use those

8:40

memories but Claw doesn't necessarily have that set up outside of threads yet i do imagine it'll come but it is kind of interesting that it you know doesn't mention the user or anything uh it does let's look at read some of this information about model safety so it cares about child safety content you know which it should

9:04

so it shouldn't you know obviously that's that's a good thing does not provide information on chemical or biological or nuclear weapons that is good even if they have a good reason for asking for it steers away from malicious or harmful use tries to you know refuses to write code that's maybe used maliciously so but it does assume that the human is asking for something legal and legitimate if their message is

9:40

ambiguous and could have legal and legitimate interpretation and so regarding conversations for more casual emotional empathetic or advice conversation Claude keeps its tone natural warm and empathetic claude responds in sentences or paragraphs and should not use lists in chitchat in casual conversations or in empa empathetic or adviceriven

10:08

conversations and it can be it says it's okay for it to be short uh so there's some more points on style to try to give concise responses and I think the reason that some of this is interesting to read is because if you're writing prompts for your agents it's really useful to see kind of what the the model prompts are doing because you can learn a lot around how to write

10:32

good system prompts uh you can learn tips and you can also learn how it thinks so if you're using in this case Claude in your agents you're going to have some ideas of how to best talk to it because you know you know what it's looking for you know how it's going to try to respond cloud engages with questions about its own consciousness experience emotions and so on as open questions and

11:00

doesn't definitively claim to have or not have personal experiences or opinions the person's message may contain a false statement or you know or presupposition and Claude should check this if uncertain if the user corrects Claude or tells Claude it's made a mistake then Claude first thinks through the issue carefully before acknowledging the user since users sometimes make

11:24

errors themselves um so be cognizant of red flags so it knows what red flags are without being explicitly told so Claude should be cognizant of red flags in person's message and avoid responding in ways that could be harmful if they have questionable intentions especially towards vulnerable groups claw does not interpret them

11:50

charitably so it talks about the knowledge cutoff is at the end of January 2025 that's interesting okay and so you know it's maybe it's March maybe it's January but the system prompt seems to say you know only January maybe there just didn't have as much information i don't know that's interesting talks about election info specifically

12:19

election info that's interesting that election info has is in the system prompt you can tell that I think a lot of people try to get these models to return information about the election or one candidate or another and they want to almost have like gotchas to say like this model is biased and here's why and

12:38

so I do think that is very interesting that they've actually called it out specifically in the system prompt i'm going to make this just a touch bigger um this one is so this is very well needed i I've seen uh notes that some people have said you know like some versions of Claude specifically like 35

13:08

and maybe even 37 more so it would basically just try to be you know almost too nice in saying that everything's good or great or what a great thought that's a great idea and so it tries to cut that out so Claude doesn't say Claude never starts its response by saying a question or idea or observation

13:30

was good great fascinating profound excellent or any other positive adject adjective it skips the flattery and responds directly and then to close things out Claude is now being connected with a person okay so kind of then it goes through some things that were removed um missing prompts for tools so you can just see we're not going to go through all this we went through kind of some of the highlevel stuff but there's some

14:03

stuff on if you get down here a ways and I know I'm scrolling kind of fast but like how they're how they use thinking blocks how they use instructions and you can see it the use of kind of XML tags in its prompt which is something that I do quite a bit honestly when I'm structuring prompts i will just casually add in XML tags it seems to improve it

14:28

for me you know it could be slightly subjective but I do think that having clear start and end of blocks is helpful and probably provides some context so I do that quite frequently if I'm writing a longer prompt but I'm curious if others do that as well ideally you shouldn't have to and as models get more uh advanced we shouldn't have to do that

14:48

kind of stuff but I'm curious if anyone else is doing that as well and so there's a whole bunch of things that we can kind of glean from that system prompt really interesting stuff but I'm curious if are you all reading these system prompts you know I know there's some other that it like whether it's like cursors prompt has

15:08

been leaked you know arguably there's been other like agents that have had prompts that have either been released you know like I think bolt.new publishes either part or all of their prompts uh but I think there's just a ton of information you can gain from reading the prompts of either the models or some

15:24

of these agents that you know do kind of release how they're doing things because we're all learning this right we're this is all changing quite frequently so being able to dive into how others are are structuring prompts I think is very useful and so Steve says he was just reviewing the client prompts today very cool Steve you should share if you have a link please share please share

15:51

the link to the client prompt so if anyone else is on X following this they can they can also review the client prompts because again I do think we learn quite a bit unfortunately I like many of you like I don't read all this stuff i physically can't and I don't think anyone can so often I'm using AI to summarize some of these things or I'm

16:08

you know reading summaries or in this case highlights of highlight of the prompt right and I think we all probably have to do that right there's so much information out there that we can't possibly uh we can't possibly keep up with everything but now I figured we would spend a little time and go through and actually test out OpenAI codecs so I have kind of a

16:33

very simple app that's just a very simple AI generative AI app i'll show you how it works obby built it you know on a live stream with me a few weeks ago at this point so it was you know it was a workshop that we hosted at MRA it's just a flashc card generator app so it's very simple but I wanted to start with something really simple and then try to ask Codeex to do some very simple things

16:59

see if it would generate PRs for me or or what we can get it to do again I haven't actually used it so I don't even know all of its capabilities other than what some members of the master team have told me they're using it for and what I've read online probably much like you so very good chance I'm going to struggle here but I think that's part of

17:19

the fun because if you haven't used it you can kind of struggle alongside me and hopefully uh if you do decide it's worth using you'll have uh much more information and you'll be able to get started much more smoothly ideally they make it really easy right and I shouldn't uh I shouldn't struggle but we'll see and all right I'm going to go ahead

17:39

and get it pulled up and I'll share the app and we will kind of talk through how it all works and what we will see what we can get uh Codeex to do but as I mentioned before earlier uh let's test out before we jump into codeex let's just test out operator i mentioned that during the news segment that operator is now using 03 under the hood supposedly rather than

18:05

GPT40 or a custom version of GPT40 so let's test out operator see how it works and if you have any anything you want me to you do want me to test in operator if you have any requests please uh drop them in the comments on either LinkedIn X YouTube wherever you're watching this and with that let's start by just looking at

18:33

operator okay so here we go let's make this just a touch bigger so it has some examples of things that it can do dining and events delivery local services shopping help me order a rustic farmhouse kitchen sign that can be customized with her family name okay yeah let's do that why not who doesn't like Etsy you

19:10

know let's just see what it does so this is going to spin up do you have a price range so it's going to ask me followup are you in the US no price range in the US currently in Sou Falls South Dakota prefer a medium to large sign with my last name which is Thomas all right let's just see what this can

19:57

do so it's going out searching Etsy it's showing me what it's doing so we can kind of watch let's Let's see if we can make this bigger all right so I don't see anything now once I made this bigger so you're not missing anything i don't know what's going on over here but I'm not seeing it so it says it's live but I don't actually see sure we will

20:25

allow allow notification that's fine but I'm not seeing anything so so far just by expanding the screen something doesn't seem to quite be working but attempting to navigate back quickly clicking the arrow checking settings waiting it's telling me it's doing stuff but I am not able to see it and making it smaller doesn't seem to fix it interesting

21:00

so overall doesn't seem to be working that well so I will say while I am overall extremely optimistic about what AI agents can do and what these tools can do if you're like me you're I'm no longer surprised when things don't work the first time you know this stuff is still like pretty new and it's non-deterministic and so and it's and and the everything's moving pretty

21:33

quickly so I'm no longer surprised when something that demos really well doesn't seem to work for me i'm I'm sure you all have that same realization right and the idea is that at some point you know going from 50% accuracy to 70% accuracy feels really good and then 70 to 90% accuracy feels great and then it's still going to take even longer to get to 95 or 99% accuracy on all these things all

21:59

right I'm going to see if I can take control says I have control i can't do anything next i'm in okay now I I'm finished up i didn't do anything but let's see if it I took control yeah I still don't understand why I'm not able to see anything so it's clearly doing things in this browser and when I take control I can see it but unfortunately when I'm

22:32

not in control I can't see it wonder if I refresh see if that really breaks it was my last request completed successfully gee I don't know maybe i can't actually tell all right so it did give me some options so it clearly went through and gives me some links that's pretty cool i guess you know not really my style i'm not really

23:15

gonna buy a kitchen sign but it does uh it did actually seemingly do the right things however for whatever reason it wasn't actually showing me the browser once I maximized it probably some kind of bug maybe I did something wrong you can see here it now shows me I can see the browser there's all the tabs are opened it did its

23:38

search it found the you know it found the right listings so it did work uh why I couldn't see it while it was doing it i don't know i expect that was something that if I tried it again most of the time it would work or some of the time it would work but clearly not that time but anyways so that's operator you

23:58

can see it kind of opens a browser can do things on your behalf the most common things I've seen you know people try is like having it buy something for you having it book a table for you i do really think that while the browser is you know it's what humans use i do think that over time more and more things will just be through APIs right why do we need the

24:23

browser that being said there's always going to be things that are either slow to adapt or you know are kind of you know just like if you go to government websites they're often not very good right or if you go to like some banking websites they're not very good so there's all these like types of things that you may need to interact with that

24:43

are just going to be available through a browser probably for quite some time but I do expect that that becomes less and less because ultimately if Etsy had has an API or if Etsy would create an MCP server you wouldn't necessarily need some kind of browser agent which is taking a bunch of screenshots and navigating a mouse but for now and in

25:03

the uh in the near future I do anticipate that more and more use cases will be found but I'm curious how many people are actually using this for you know besides demos right now are people actually automating workflows i've seen some that are starting to use it internally to automate tasks where you might you know not quite be able to script everything but ultimately if you

25:28

can script it it's going to be much more reliable than having an agent do it and actually when we had Eric um from Pig he's building muscle memory right it's kind of this idea around if you can script it you should but if you can't then it could fall back to an agent and then as the agent goes through and figures it out it kind of

25:47

scripts itself so that it's almost like self-healing in a way and so that I do think that he's kind of ahead of the curve on that and while we're here why not just pull it up and and show it because I do think that there is a there's something to be said about that kind of pattern so let me pull up uh Muscleme as it's as it's called and while we're uh while we're here let's go

26:22

ahead and give Muscleme a star while we're here and while you're while you're at while you're on the GitHub website feel free to go to find MRAAI and give us a star as well but this is kind of this idea of it's a cache so you can kind of replay complex behaviors rather than uh so you let the agent determine it but then you can kind of like almost script it and replay it almost like a cache and I

26:48

think there's going to be different approaches to how this works but this is uh a good first step and there's probably and I know there's others i think the browser use guys have um workflow use and they're kind of coming up with some of the same ideas so there's going to be a lot of interest in this kind of space I think over the next

27:06

year all right so now let's see if we can get codeex to build something for us so we have this app very simple flashc card app let's get started let's enter a topic let's call let's create a topic on AI engineering let's create beginner flashcards and let's just do five to keep it nice and simple and let's generate some flash cards we

27:34

should have some better UI for actually like what it's doing but under the hood this is all this code is open source i'll pull it up here in a second but it is just making a simple call to OpenAI to generate some some flashcards so pretty simple what is AI engineering there is yeah a definition of AI engineering what are some common applications

27:58

virtual assistants recommendation systems what is machine learning in the context of AI engineering it's a subset of AI involves training algorithms okay so you can see we have flashcards that were generated really simple application but let's go ahead and do some things here to make it easier so I think one of you know people you give

28:26

people an empty prompt they might not necessarily know what to put in here so maybe we could give them some like cards down below or suggestions they could click on that would just fill in the topic rather than just making them type so maybe between topic and difficulty level what if we had some cards that seems like a pretty simple request i

28:46

feel like we should be able to ask codeex to generate something and to maybe maybe we can ask it to randomly uh select from a list of topics so generate a list of random topics randomize which ones show up here so if I refresh the page I'll get different options so that's what we're going to try to do

29:09

so I did set up codeex and the only thing I've done so far is you basically have to go through an onboarding experience i set it up so you it had to connect to GitHub i connect it to connected it to the this repo which I will share so the repo is right here if you want to see how this application works it's all here in this uh flash genius it's it's public so it's in the

29:43

masterai or it does give you like some basic tasks that it uh you can basically click and it just every repo you connect it seems to just offer some things so it the first task is explain the codebase and structure for newcomers so we can read it and this will tell us a little bit about this codebase here

30:06

so it's a React web app uses AI list here's the text stack here's the kind of top level structure there's a MRA agent and workflow which is good that's correct there's a flash card agent there's a generate flash cards workflow there's a client that uses the master client and calls generate flash cards talks about how routing works so

30:34

it does give you know what we pretty much expect AI to be able to do gives a pretty good summary it does tell you how to install or how to get started so you got to launch the master server with mpm run masterdev and then launch the vit dev server with mpm rundev so all in all pretty good i would give that you know that's an A that's

30:58

kind of a softball in my opinion but it worked so now this was just a generated uh task i have no idea what kind of bug it was looking at but let's just see it says pick a part of the codebase that seems important and find and fix a bug very uh general so it replaced the medium difficulty option with intermediate so it matches the rest of the app and backend ex

31:23

expectations ensure the topics selector uses intermediate as the default and available option for difficulty fixing an inconsistent label and request issue okay so rather than medium let's change it to intermediate so I don't know that that really is going to have that much of an impact because that value I think just

31:51

gets passed into the agent but we should probably just look i'm curious how we handle this uh let's see so we call in pages in create there's and I don't know how maybe I should make this a little bit bigger uh in create there's this generate flashc cards which is in the client so let's just go here it creates a new client and difficulty so it just kind of

32:31

passes I think it just passes that string on flashcards about whatever at difficulty level so it's Yeah so again I don't think this really is a bug it found something it fixed it okay i could create a poll request let's just see what that does so create a poll request for that we can review it later decide

32:52

if we want it find issues and propose fixes so let's see what it did here there's a no agents.mmd file repository lacks a testing framework typo in difficulty option okay so it found the same thing it already fixed so it can suggest it looks like in the in codec it can suggest additional tasks which I could uh just go ahead and

33:23

start executing so let's see what it says here bug and entry script path so I don't know if this is actually a bug because the app definitely starts correctly so I'm not exactly uh sure why that thinks it's a bug but I imagine it's it doesn't seem to be correct because the app does run i can run it locally uh I'll just actually fire it up i do have

33:51

it up nope I don't so I do have it running locally you can see here so that's doesn't seem like a bug to me but okay clarify flashc card properties and agent instructions missing tests sure this seems like a good one let's try let's see if it can generate some tests we'll see how that works and while

34:19

it's doing that you can see it triggered this let's describe another just another task okay so let's structure this by saying on the create flashc card page we want to provide users selectable options for for the topic we should still have a text field however below the text field let's add a list of cards or links

35:02

that can be clicked that will automatically fill in the text field the goal is to allow a user to not have to think about the topic and select one from a randomly generated list now here's where we could just let it go i kind of want to be a little more specific please create a list of randomly generated topics and then only display a

35:38

a small subset of those on the page this way if the user refreshes they see new topic options to select from all right so let's go ahead and just say code this thing so this should fire off another task I think and we will see what it does so this thing's running it's reading the read me you can kind of tell it it tells it what it's doing so the

36:11

interesting thing about this that's kind of cool is that you can fire off a whole bunch of tasks and see what it does and get a bunch of PRs so if we go let's see if we got a PR here we do so fix difficulty option mismatch so this is pretty cool especially for those types of bugs that are you know I guess maybe more straightforward or simple and you

36:35

know that it probably can do a good job i mean the challenge with this is I would still need to it'd be no different than me reviewing a teammates's PR i suppose that if I wanted to actually test it I would need to you know check out this branch run the code test it but you know there is something about I'll

36:56

candidly say when I review PRs I don't always test the PRs if I if it's from someone on the team and I know that you know I can understand what they're doing I trust that they've tested the PR right um obviously some teams are different some teams you kind of require every PR should be tested by a second person some teams are a little bit more yolo and if

37:16

the code looks good you ship it but this is of course a little concerning because it's never been tested right by at least not a human and maybe that's fine eventually but I don't quite have the the level of trust i guess we we kind of saw you know operator didn't work the first time or or perfectly i kind of expect that these

37:34

models don't always work it kind of pointed out a bug that I don't think's actually a bug so again this code I'm pretty sure like I'm pretty confident I can merge and I know what it's going to do but ultimately I'd have to test this thing right so let's see where we're at with these uh these tasks we're going to let this thing run for another minute or

38:02

so but we can probably click into one and I suppose I can make this just a touch bigger looks like it's committing changes checking get status i am kind of curious what do you mean like committing hopefully not to I'm assuming just like a branch but so it added a topics file containing a wide range of sample topics the topic

38:28

selector component now picks five random topics from this list and displays them clicking the suggestion fills in the field automatically programmatic checks fail because dependencies were unavailable in the environment introduced a topic bank with numerous preset options and so the environment doesn't have network access after setup so it couldn't run certain commands i don't

38:52

know what commands it's trying to run though oh they run the type check okay well that's interesting but we can go ahead and I guess it's let's create a pull request after this so the cool thing is and I need to test this but this is all set up with forcell so I should get a deploy preview for this and so I should be able to actually test this thing before I merge it

39:20

right there's some interesting topics i might uh want to change that but ultimately wrote the code we can check the poll request and yeah and let's I'll go ahead and check while we're while we're letting this other one run see if that one's done it also set up some tests i added V test here some mocks i could of course I of

39:55

course would review this but let's just create you know yolo let's create this poll request and then I can check this one too looks like it's still creating and I'm going to see if I can get the see if the deploy preview is working and we're going to test this out oh I got to log in all right let's see

40:35

here one moment let me get logged in all right we're in all right so let's do it go ahead and give this a test you can see the suggested topics are here sure i want to learn the basics of investing i want to do some beginner i'm very much a beginner let's do number of cards five all right let's see if this

41:04

works moment of truth and then we do have a a guest we will switch over to so for those of you that are just joining us we've talked about some AI news we are trying out codeex and seeing if it can replace me and actually make it look like I'm submitting PRs to the master code base but we're starting with just a flashc card uh app that Obby and I built a while back so far it seems pretty good

41:31

all right it looks like we have some questions were generated so it worked right so that again very simple feature but I would say basically one shot like zero shot did this thing exactly as I wanted it to be you click on it adds the topic i can customize it it did exactly what I asked for so that's pretty cool

41:51

next time we'll maybe try we'll give it we'll throw a few curve balls and I imagine we can get it to uh get it to break but thanks for joining us and now let's go ahead and bring on our guest today Anna Rude from Stage Hand so I'm going to bring you on and we're going to chat a little bit about Stage Hand and probably

42:15

some some browser base hey Shane can you hear me i can hear you well oh yeah nice to thanks for having me here yeah welcome to AI Agents Hour good to good to chat with you i mean I've interacted with you many times online but uh first time we're actually chatting yes sir yeah I think we've been like uh we have shared Slack and big fans of Austra over here um I think just in general right like the TypeScript AI

42:39

community is relatively small uh it's really nice to see other people like you guys uh pushing the envelope forward small but growing we like to say right yes sir uh yeah I mean maybe g can you give a quick background on you and then we can talk of course talk stage hand and browser base or wherever else you want to go with it yeah for sure uh I'm

42:59

Anie um we work on uh this framework called stage hand here at browserbase um our take on it is it's basically just a um AI enabled playright so uh I think a lot of people want browser agents uh there's this it's a kind of a huge buzzword now um but really what we wanted to build was something where you could kind of you could go the full agentic route um but what we're seeing

43:26

is you know in the age of AI uh humans are becoming liability shields right so what that means is basically you kind of want AI to take over and you know oneshot an entire workflow but a lot of times you actually need a lot more guardrails to make sure that the AI is actually doing what you want it to do

43:44

but being fully deterministic and writing browser automations using like traditional Selenium or even more modern tools like playright is still really really cumbersome uh super brittle And so that's why we wanted to just achieve like a happy middle ground with stage hand where you know if you want to go fully deterministic with just DOM

44:01

selectors everywhere and like you can only do this specific thing you can do that with stage hand and if you want to go fully agentic um you know leverage frontier kua models um from openai and anthropic um you could also do that as well so kind of just however you want to interact in the browser I think stage

44:18

hand kind of has a solution for you yeah I think that's you know we were just actually demoing operator before this yeah and you know it didn't didn't completely work but it kind of worked uh but I I do think it you know and we have some we talked to someone from Pig which is a YC company and they're kind of building muscle mem which is this idea

44:40

of like this caching layer so because exactly what you said sometimes more deterministic is is better right if you can get the the DOM selectors and you know HTML doesn't change that often so you can actually know that it's going to work correctly almost every time or much much higher uh percentage of the time versus if you just give it to an agent and let them determine then you're more

45:02

definitely have even more issues with network load and and all that stuff like maybe it's going a little slow and you don't wait long enough and and all that and also just much faster right if you can use the the DOM selectors versus you know telling a mouse to move and click a button yeah I think like something we noticed too is like um with agents right uh a

45:24

lot of times with one sentence agents are really good now at accomplishing a specific task um even like 80 90% of the time uh but sometimes that like that 10 that 10% you will encounter if you use it frequently enough uh and a lot of times you're like okay like for this one specific scenario I'm just going to add this extra sentence to this agent prompt and it's going to figure it out and then

45:47

it runs into like another issue right and you're like "Okay I'm g add another sentence." And very soon you end up with like paragraphs and at some point you know I think with Vibe coding tools and everything code isn't as scary anymore and so I'd rather just explicitly write loops and conditionals instead of prompting it in like a really uh

46:06

you know spaghetti way and and the problem is what you sometimes run into is you fix it for one case and then you maybe broke it for another case that was working before it's very difficult to like know because you you're not testing every case it's you know even in software it's like test coverage you can at least know there's 100% theoretically

46:24

of test coverage there's there is no 100% in prompts right of like what someone could potentially do so there's no way to really test for everything and this is you know candidly in Ma it's it's similar right we have this idea of agents where you can give it a prompt and you can just let it do its thing give it some tools or you can use

46:44

workflows which are much more descriptive and deterministic and it's that same idea is that if you can make it a workflow you probably just should especially if you can just use an agent to kind of write that workflow for you or help you write the code and now you know that with much more reliability it's going to it's going to do the

47:00

things it says it's going to do 100% % yeah I think it's it's interesting right because like there's no like linting or type checking or anything like that for prompts and so like spaghetti prompting is actually a lot more dangerous than spaghetti code right because at least spaghetti code you can you know throw that in an LM and get it to refine or

47:18

you know make sure there's like certain rules there that you know you have brackets that close right but you can repeat the same sentence five times in a prompt and how are you going to catch that with a liner it's much easier to you know if you need to console log spaghetti coal code you can't really console log a prompt very easily of

47:34

course you can kind of get it to try to explain itself and all that but that's just the LLM generating a response of what it thinks it did right so it Yeah it I would agree i never really heard spaghetti prompting but I think that's a that's a good term i just uh came up with that on the spot actually all right hey everybody let's spaghetti prompting

47:53

let's make that a thing um yeah but so so what else so what do you got on the I guess what's next with stage hand what are some things you're working on that that you know I guess how big is the team that that works on stage hand i know you know browserbased obviously does more than just stage hand but you at least have part of the team that's focused on working on stage hand right for sure yeah we have a team of um

48:18

four engineers including me uh actively hiring more i think uh we're still growing uh yeah I believe we have a couple more people joining soon hopefully but uh yeah four engineers right now uh dedicated to stage hand um the team is going good i think like what we're working on right now um a couple weeks ago uh we put out evals for

48:40

different LLMs it's something I've been wanting to work on since like November honestly and I just thought it would be like a fun thing to just say we did and it actually got quite a bit of attention um more than we anticipated at first but I think it's an interesting problem because because of our paradigm of you know this like we have three simple

48:58

tools we have act extract and observe so for people who are unfamiliar with stage hand um at its very basic core uh there's like three primitives for interacting with a browser so we have act which is like do something on the page uh like click this button uh fill out you know this text field or something uh extract which reads from

49:20

the page and conforms it to a expected schema and then we have observe so again with agent reliability uh if you do like page.actclick this button uh you want to make sure that the button is actually the button and so we have like page.observe is kind of like a planning step and so that can basically when you say like page.observe observe click the sign-in button uh you can get a JSON

49:46

response that says okay like the action is going to be click uh the X path is going to be this and so you know exactly what it's going to do and then you know you could confirm it or cache it uh to avoid lm infrance hits later uh etc and kind of what you get with that right it's basically just it is just a GBT wrapper right uh it it depends like what

50:07

we send to the LLM but in essence what we're sending is just like the DOM and the accessibility tree and we're getting you know a structured output back of what to do and what to do it on and so what's interesting about that is basically it's a binary accuracy result right if you say click the sign-in button it either gives you the payload that says click the sign-in button in

50:34

JSON form or it's wrong right uh it's not like an agent where you know you could say sign into this form and there's like so many different ways it could do it it could like fill in the password first and then the username and then click sign in or it could you know type the username type the password and then hit enter um it could type something wrong self-correct is that a

50:53

good thing or a bad thing um so you know with SH Chan there's like very little room for error uh it's just it did it or it didn't and so with that it's like really nice because you can actually basically just effectively unit test a like an agent uh because at each step you could say did it do the thing or did it not and so it really becomes an eval

51:16

of how good are different LLMs at analyzing really really large nested hairy structured data uh like a DOM or an accessibility tree and turns out you know Gemini is really really really good at this um so they have like a really long context window i think they they built up a reputation for a while as being like really good needle in the haststack LLM yeah so that's and this is

51:41

what this is what we're talking about right am I yes sir all right so yeah let's So let's dig into it for sure uh I think candidly uh the Deepseek one is a little off i think it's because we ran into rate limit issues and so a lot of them just ended up erroring so that's a misnomer gemini is still the fastest um

52:01

yeah I think that there's like a little slight uh small issue there but um yeah generally you can see like if you sort by accuracy Gemini 2.0 flash is pretty up there and also again as if you sort by cost or speed Gemini is like in the top three which is pretty amazing uh given that it's like the best at it's the best the cheapest and the fastest i think it's really really rare that you

52:25

can hit a holy trinity like that uh especially you know like given that claude is 40 times more expensive and slightly less accurate um really really huge W for Gemini there uh but and so and so what and now you kind of mentioned it but can you talk through what how are you making this judgment i know you have this link here which I I

52:47

could presumably click on and you know go go through and read the whole post but what are when you define accuracy like can you talk about the tasks that you you're having it do and then whether or not it's acc how do you judge if it's accurate or not yeah for sure actually if you click on that link um in the uh

53:06

Yeah the word here all right yeah are you not able to see it there we go and then if you scroll down um to I think the bottom uh it should say like view it on hex uh yeah if you click hex um and then if you click the Gemini run or the open AAI run actually uh you want the open AI run yeah either one yeah so this is like our um one set of

53:43

eval uh on Brain Trust and Yeah and it's taking some time to load yeah I can fill you in uh while it loads but basically every one of these evals um basically just sends this payload to the LM of hey here's the DOM and this accessibility tree it's a custom data structure that we built to represent um a website and its contents and given those contents

54:09

can we say click the quick start button and then resolve this DOM to the X path of the quick start button the method being click uh and a description saying we're going to click the quick start button and that way you can cache that and then the next time you say click the quick start button it can pull from the

54:27

cache to avoid an LLM inference and if the X path doesn't resolve for example you can then rehit the LOM so basically all these evals are just one or zero of did it resolve to the right expected um response or did it not and so what you see there is like you know um on extract tasks for example it can be like 90% accurate um on act it can be 80% but

54:52

it's very binary one or zero um which makes it quite rigorous for an ebal set that's otherwise you know very subjective yeah yeah yeah I mean it's definitely uh yeah you it's nice to have the data right so people can can dig around and see okay how are you judging this you know I imagine you know of course model

55:17

providers always want to hit benchmarks and this is at least an idea of okay it's it's kind of a benchmark in a way right of like accuracy yeah I think it's like a it's interesting like it's not really a benchmark because really it's it's a we use it to test internally against regressions right so if we say that we're improving act

55:37

performance how do we know that we're actually improving act performance theoretically we don't know that we're improving it what we have to do is basically ensure that we're not breaking performance and then add a new eval to test the new feature yeah yeah ultimately yeah so you you want to make sure you don't have regressions and then you know ideally you continue to improve

55:56

your eval data set over time and add more you know more and more uh you know tests basically along the way to make sure that you're continuing to make it better exactly and I think it's actually more like regular software engineering as opposed to AI engineering right because benchmarks you actually want those to be as low as possible so that they can you know increase over time as

56:14

models get better but this is more like you want this to be as close to 100% at all times uh so yeah you can like I think this is like a really interesting like visual here is that you know this is actually a surprise for us like we just defaulted to GPT40 for vibes you know it's just like uh you know everyone just see anthropic and then in building

56:35

this we were like oh my god Gemini is actually really really good and it's super cheap so why are we not using that as the default yeah yeah that's very interesting i mean I don't see uh Claude 4 on here i imagine that's coming actually so that's the interesting thing right so um Claude 35 and Claude 37 are 40 times the price of Gemini and less

56:57

accurate and so Claude 4 is relatively similar right i think that when you're paying an order of magnitude less for Gemini and getting better results it actually just doesn't make sense to even try Claude 4 because it's just a different um use case right like in what case would you recommend getting paying again 40 times more and even if it's 92%

57:20

more accurate or 92% accurate you know it's not it's not enough to convince me to switch off of Gemini but what is interesting though is that you know I think the nice thing about stage hand is it's just tool execution so once you know hey I want to click the sign-in button you can use stage hand to click

57:39

the sign-in button but where cloud is actually really good is determining what to do right so we talked about like you know stage hand being this thing of the agent can only do this but the agent has to know what to do as well and I think that you know with MCP and everything we're seeing that if you hook up these stage hand tools to an agent being

58:00

Claude Claude is really really really good at accomplishing like longunning um agentic tasks so when Claude 4 came out actually uh I did like some fun experiments uh I follow basketball i'm a huge Cavs fan and I just you know asked Claude for Opus hey do your research on what you think the Cavs should do in the offseason and then make a potential

58:24

trade on the NBA trade machine and it did it it was like a valid trade it was one that made sense uh and is the fact that it was able to call the stage hand tools appropriately uh I think that's where like agents are headed right in terms of knowing what to do and reasoning about an arbitrary situation as opposed to just you know a very

58:43

simple task of analyze this DOM and get the right element yeah that makes sense i mean I think as as we continue you know as models get better being able to to have it do increasing increasingly complex tasks over time seems like the path we're on right i do still you know as many you know I mentioned I joking I joked earlier but also seriousness uh you know things still break down pretty regularly

59:09

right i think we all experience that but if anyone that's probably watching this does see the you know you wouldn't be watching this if you didn't see the promise and where this stuff is going and I do think that that increased complexity of being able to give a more uh you know kind of general task and then have it actually go out and make a

59:28

plan execute the plan with the tools it has that that's where Yeah that's where the ball is going you know not to use the the the sports analogy since we're talking sports no 100% i think what's interesting too is like um I was at this event a couple months ago uh and this person came up to me and said uh like yo

59:46

yeah MCP is like the internet for agents and I was like okay relax right I think that it just felt like the most buzzworthy thing I'd ever heard but I think looking back I think you know hype aside I think there's some truth to that um especially given that you know the internet itself is just a conversation

1:00:03

between your computer and some other computer in US East one Um I think MCP you know sorry still sticking to the internet when you hit an API you're not just hitting like one monolithic API that gives you a response right you're actually hitting an API that hits its own APIs they hit their own APIs and they all just kind of collate together and give you one final response um and I

1:00:26

think that's what's really interesting about MCP uh and how we designed Stage Hand as well where you know Claude when it's executing this plan it can just take a screenshot and just say "Okay given the screenshot I think we should probably click this button." And you don't need to pollute the context of that agent with the contents of the DOM

1:00:44

and figuring out how to do all that it can just kind of throw that at an isolated agent like a mini agent um so it becomes kind of like agent microservices in a way yeah agents all the way down yeah it's microservices all the way down right yeah uh but I think that's what's really interesting too it's like yeah agents talking to many

1:01:03

agents uh that can you know like for example with Claude Claude is the one determining what to do and it's offloading that to Gemini to actually do it uh so yeah the whole idea of like agents that talk to agents and like the the composition of different models talking to each other i think that's going to be pretty common because I do think that we're going to find there are certain models that do better at

1:01:27

different types of tasks right specifically you know like you might need image generation well there are certain models that are going to do that really well you might need you know writing code which maybe Claw does slightly better than others you know on writing code but obviously you know using stage hand uh you know it seems like Gemini is much better at you know

1:01:47

navigating maybe the web and so there's yeah I do think and then it'll change over time right as models you know new models come out they'll kind of jump each other and so making it uh one agent that can kind of communicate through some kind of protocol which MCP at least is what seems to be leading the way right now with how that's going to work

1:02:06

um seems like the fe the the future that we're probably uh heading towards right 100% right like even like with MCP uh even outside the context of stage hand right like you'd have like a docs MCP where you know you could ask claude hey how do I do this using master and master you know claude would probably offload that to a docs MCP which could then further specialize that into okay where's like the master MCP within my

1:02:33

docs registries uh I saw you guys built an MCP registry registry that I thought that was hilarious yeah yeah it was it was the the joke that there are so many registries how are you going to keep them and then you know like it'll get easier over time but yeah we do have a the the Mastra MCP registry registry which we try to keep track of all the available MCP registries that are

1:02:52

popping up i think that's interesting too right like what does like DNS look like in this like agent web what does you know Chromium look like um like if you're talking about like internet for agents what are the internet building blocks uh in the context of AI and what are the the mechanisms for an agent to

1:03:10

potentially be able to look up MCPs and then you know get permission to use those you know organically rather than having to have some kind of defined system prompt and set of tools you know you know at build time or whatever you can basically do it at runtime and allow your agent to like go out and find its own tools and then you navigate and use

1:03:30

those as needed and then maybe get rid of them or if it doesn't need it you know anymore so yeah I think that that's again the it's hard it's difficult to see that because there's still a lot of a lot of ground we have to make in order to get there but I do see that as like the what's potentially coming in the next you know 12 months i think I think

1:03:49

we'll at least have really cool demos of it of things doing that and maybe maybe that's even a long timeline i don't know i I think it's interesting too like how different models are kind of contributing to vendor lock in in different ways um they all have their own unique features like OpenAI for example they have like reason models are just a separate class right like GPT4.1

1:04:08

just cannot do reasoning versus with claude they're like oh you can just throw in a reasoning step to sonnet uh also I think just the most annoying thing for us personally is that claude is the only one that supports multimodal tool output um in terms of especially MCP so I mentioned earlier that claude can be a

1:04:28

web agent because it can take a screenshot and reason about what to do uh but only claude can do that which is the weirdest thing like no other LLM provider can use a screenshot tool uh which is just the most mind-boggling thing uh it's we have many many at least it's come up a couple times where users of MRA are building tools that need to take a screenshot and so they're using claude

1:04:55

because it's the only one that really has multimodal support but also you know in Ma we try to support any provider but we need you know not every provider does support that specific type of feature right so y it is a very it's it's frustrating that the other models don't support it but also I'm I'm grateful that claude us if that makes sense yeah

1:05:14

yeah I'm curious also like you know in this time how do you guys think about incorporating different LLMs right I think like for stage hand one of the biggest pieces of tech out that we're getting around is when we started we were like okay we are going to use our own LLM client implementations so we didn't use versi SDK or anything we just used the straight open AAI SDK and the

1:05:34

anthropic SDK and it was just the worst because different LLMs came up like Gemini for example this is why we ignored Gemini for the longest time because we had to write a separate LLM client implementation for it um but like you know using AI SDK has been really nice for us but how how are you guys thinking about you know LLM as they

1:05:52

evolve and incorporating all their different yeah use cases into one nice abstraction so we also started we were going to roll everything oursel and we did that and then we also use AISDK now so I think it's just you know it's you know the common way to do it it's like we all like I didn't know that you were using AISDK for instance and I bet you

1:06:12

might not have known that we were using AI SDK but it does make sense and I think that's where you know if we can agree that we're going to have this SDK that does provide this kind of model routing and we can build on top of that then ideally we you know we just contribute back if needed to help make sure that that mission is solved because

1:06:31

obviously it helps stage hands it helps MRA it helps you know Verscell and their interests with you know using AI SDK K and Vzero or wherever else they might be using it 100% i think that's like really cool too right like I think for all of us in the TypeScript ecosystem like us using AI SDK as you mentioned right uh you can build an MCP server using MRA um

1:06:52

throw it you know stage hand tools uh using our MCIK integration as well and like choose your LLM using you know so I think yeah it's really cool like I think uh in JavaScript land everyone kind of coaleses on like one stack like the mer stack or something right back in the day and I think now you're seeing like the same similar thing where you build like mo for your agent like AI stick for your

1:07:15

LLM client hopefully stage hand for your browser tools Uh so yeah super super super bullish ecosystem yeah yeah and I I think it's you know you're seeing more and more uh adoption and more and more different like really cool projects being launched in in the TypeScript ecosystem so excited to see as you know I I think we're kind of just at the beginning you know we we of course are kind of all in on on TypeScript and believe that you know

1:07:42

what even though Python it made sense for Python to be the dominant language you know and I don't think it's going away like there's going to be a lot of people using Python for AI for a very long time right and especially when you get to the point where you need to get down to the model level and do you know more training and tasks such as that but

1:08:00

for building applications there's no reason that it needs to be Python and so I We're seeing more and more people choose the tools that they know and I do think that even other languages will you know are slowly popping up AI frameworks which makes sense right I don't think it needs to be specifically just Python or TypeScript though I do think that TypeScript is going to continue to be

1:08:19

pretty prominent and growing over the next you know next couple years for sure yeah I think TypeScript is like the application layer language right u versus Python is more the analysis and the like offline stuff yeah we we like to joke internally at Mafra that you know Python trains and Typescript ships so Oh yeah

1:08:39

and I So one of our one of our engineers had made like a Studio Gibli image of like a Python on a train and then like a TypeScript ship you know it's funny i'll I'll pull it up at some point um but this has been a lot of fun anything else that you know is going on in Stage that you want to share with you know the 150ish people watching right now yeah I

1:08:59

think what's really cool too is like I like I'm trying to take evals one step forward um something I did as like a fun experiment if you do npm create browser app uh-ample chess uh you can actually just watch OpenAI and Anthropic uh one v one each other in chess um that was just like a fun I was just like okay so we allow you to use any computer use agent in the browser um so I just wanted to

1:09:26

find a cool way to kind of showcase that right and you can have like agent one equals anthropic agent two equals open AAI uh it was really cool there to see sorry yeah do you have a like a link where we can show them you can send in this private chat and I'll pull it up or you can just share your screen be cool for those of you who want to see it you know let's

1:09:48

uh let me see i I see is it create browser app yeah it's npm create space browser- app um d- example space chess i got connect for uh here's connect 4 for example um otherwise I'm feeling pretty brave i can run this thing locally if you want me to and we can just try yeah do you have a browser base account uh I do but I'd have to

1:10:19

probably log in but let's I'll pull this up and we'll we'll take a look at this this is a Yeah I think the KUA agents can be a little iffy locally because of like browser dimensions but it works a lot better on browser base yeah well here we go that's Connect 4 so that's not computer use but I think what's generally really fun is just having agents play each

1:10:41

other in chess um oh there we go i found the chess link uh but yeah it's just like it's really cool because you can see here like how it's reasoning about what move to make next and I think what's really cool about games specifically in the context of Asian eval is that especially in the context of strategy there's often like a really like a right move to make and so I'll leave you with this right i

1:11:07

think no one's going to reward hack and build an agent that's the best at Connect 4 but I think that what we're trying to get at is if you can judge at every move how good the move was relative to the move that it could have made you can effectively judge an LLM based off of how good it is at calling

1:11:25

tools and calling the right tools and also reasoning about an arbitrary scenario and so if you can present Connect 4 an arbitrary situation to an LLM in a way the LLM can reason about and understand the situation and the LLM can actually do it well then the problem just becomes communicating an arbitrary situation to an LLM and so if you can

1:11:46

assign a score of how well an LLM could reason about this arbitrary situation then you can expound upon that and instead of connect you could build it you know for websites or for browser agents so yeah very cool there and I think this is probably the best place for people to follow you yes sir so check check out

1:12:05

you know go give Stage hand a star on on GitHub and check out Annie's uh X account follow for you know your good follow for all the kinds of browser agent type content anything else that you want to say before we sign off uh likewise just thank you so much for having me here appreciate your content uh and the whole monster team's content big fan of you guys uh yeah thanks again for having me yeah of course same when

1:12:33

it comes to uh browser base definitely big fans so you know I I'll chat with you on our Slack at some point soon I'm sure yes sir i'll see you around yeah we'll see you all right everyone if you just joined we talked a little bit about AI news we talked a little bit about Cloud 4 reviewed some of the system prompt highlights we talked about how Cloud 4

1:12:56

could uh maybe if you're using the Opus model maybe just call the authorities on you which is fun or interesting or concerning we talked about or we did test out OpenAI's operator and because it is now using 03 under the hood we tested that out it kind of worked even though we had some bugs we t we did test out codeex

1:13:16

and honestly I was pretty impressed with Codeex we gave it very simple tasks of course but it did for all the simple tasks I gave it it did work uh it did have some issues where it thought things were bugs that I don't think actually were bugs i didn't look into it that much but go back and watch the recording if you're curious on how OpenAI Codeex

1:13:37

works or what we got it to do you know all the recordings are always available immediately following you know we're live on YouTube but also the recordings are available on YouTube after the fact and then we talked with Annie from Stagehand about some of the evals that they're doing some of the stuff that Stagehand is working on talked about

1:13:55

browser agents in general so appreciate you all joining us today this is a pretty short episode we we kept it at an hour 15 minutes normally lately it seems like we've been going two hours every time but we're going to sign off now and I imagine you know you might see Obby again in the EU time zone while he's

1:14:14

kind of doing his Europe tour so you might you might get two AI agents hours uh some of these days but for the most part we are Monday through Friday right around noon Pacific time we try to go for one to two hours every day try to bring you the you know keep you up to date on what's happening with some news we talk to some people from the MRA team

1:14:34

we highlight you know interesting uses of agents and we bring on really awesome guests so make sure to follow the Mastra AI account on Twitter or X if you're not already you can follow me as well and we'll see you next time goodbye

More episodes