Back to all episodes

AI Agents Hour with Yujohn, Daniel, and Tyler

June 6, 2025

AI news and some pair programming with Yujohn, Daniel, and Tyler

Guests in this episode

Yujohn Nattrass

Yujohn Nattrass

Mastra
Daniel Lew

Daniel Lew

Mastra
Tyler Barnes

Tyler Barnes

Mastra

Episode Transcript

0:03

all right we're live i think we did it that's it i think we're live got a lot of uh abnormal faces this week not not like that their face is abnormal but that they're normally uh on the stream hosting the stream yeah um yeah I guess uh it probably takes a while for people to join hey yeah yeah oh we already got four people watching got it

0:33

well people will be watching the joining the reream as well on YouTube so we can people watching from the very start we'll see all this uh useless garbage we talk about in the first 30 seconds yeah cool all right well today you got three monsters you got Daniel who you might have seen before you John and I'm Tyler um we're going to start with AI news and

1:00

then maybe we'll talk a little bit about uh ourselves after that and then do some making moves we're gonna we're going to build some some stuff write some code oh yeah um yeah we're trying to think about what to talk before this and it just seems like there was so much news in the in the last day ai news i know there's a lot of news

1:26

outside of the AI world too uh that happened in the last day we're going to stick to AI though yeah we had a big long list we had to whittle it down to just the most important ones uh okay daniel did you want to share your screen or what do you think uh let me let me see where is the I know we put the we wanted yeah yeah go

1:54

for it if you're if you're ready for that you have the stuff pulled up give me a sec here e John how's the weather in Seattle right now uh well it's looking sunny um sky is blue uh Ask for more pardon me i said you can't ask for more yeah I know maybe I'll use the weather agent you know I got to do the uh great PMPX crate

2:35

mashra at latest i'll find out you get a free weather agent yep um yeah there's like there's like wildfires close to close to where I am not like super close but um it's so hazy here like yesterday and today and they're just like "Yeah don't go outside the weather quality is like absolutely terrible welcome back Tyler." Um

3:12

okay i only have one monitor okay can you see my screen yep yeah cool okay all right so the first thing we're going to talk about is uh the new Gemini 2.5 Pro so this just got um released yesterday um and it it was already the number one on a lot of these uh benchmarks um but it's even better now so it beat itself so on the uh on L LM arena it

3:42

jumped up by 24 points um WebDev Arena it jumped up by 35 points so yeah maintaining its lead on the leaderboard at 1470 so and you know what i'm I'm all for it that's I' I've been using Gemini mainly to code and just seeing improvements makes me makes me not want to switch yeah Gemini Pro is pretty sweet what's up could you quickly go over the element

4:12

arena stuff and sort of explain like what those numbers mean to myself so yeah that's what I I just opened up LM Arena here can you still see the tab I open yeah we can see it so I think LM Arena I'm actually not super familiar with this one i think let's see if we can see how this one works i believe people actually rank these but I might be wrong about that

4:41

yeah it's like it's voted by people actually using it on the site here prompt vote advanced AI oh oh so this isn't like benchmarks this is just kind of like anecdotal i think this one this one specifically but there's also web dev arena and as well sorry Tyler could you actually zoom in your screen there yeah yeah for sure

5:08

pump it dump it uh maybe one more one more is that Yeah okay cool yeah let's bump this one up too so this is LM Marina here and we can see I mean this was the previous version which is now two rank two the new version is rank one so a lot of people are voting this one up um but yeah that is pretty subjective

5:39

so let's let's look at um it's a it's a popularity contest and looks like Google's winning the bell of the ball yeah to be honest I'm still on claude uh 3.7 so on Wind Surf you're you're living in the past you got the future webdev arena this one I'm also not familiar with this it says that it's a real-time coding competition where models go headto-head so that actually

6:11

sounds a bit more legit developed by Ella Marina right so I guess some people made it but it's apparently beating out Claude Opus 4 which is a very good model but I'm not I'm not surprised 2.5 Pro is amazing isn't Opus 4 like hella expensive too and pretty It's way more expensive than than Gemini that's for

6:35

sure yeah oh oh Inception yeah cool what else we got um Ader benchmark guess it's doing good on that maybe we don't need to look at every single one but I mean this is this is pretty cool i haven't tried it out yet daniel did you try this out today uh no because I I don't I mean in cursor you can't tell which one you're using i don't think you can select between the different ones oh okay well we'll have

7:10

to It just kind of says like Gemini 2.5 Pro so maybe they switched to the new one it's Gemini season watch out for all the Geminis out there oh yeah obby all right what What do we got next um Oh we got a bunch of cursor news cursor's kind of taken over yeah you want to you want to tell us about the cursor news yeah I guess there's um they launched

7:44

1.0 and had a bunch of updates let's see if we can pull up I think they got a change log here let's see can I share my screen yeah and don't worry Obby i'm bumping it up i'm bumping it up okay um oh we You know what Obby you being a Scorpio surprises nobody here i don't know what that means though yeah i don't know anything about Scorpio what's the Scorpio supposed

8:23

to be like like obby I guess yeah if you're like Obby you're a Scorpio um yeah so we got cursor 1.0 uh oh they they released a uh MCP uh registry or directory that you can just where I think do they have a link in there maybe might have to scroll a bit yeah you'd hope so but it doesn't look like it oh no this is something else

9:02

could be further down i saw that they also had like Oh yeah that was Yeah it's a big release they got a lot of stuff there you go oh yeah you can add some curated servers oh there's more here than when we last looked oh I think we're seeing your other tab oh oh I only shared the tab not the screen okay let me let me fix that

9:29

let me just share the screen there yeah so they have this this here and I guess just one click install making it convenient showing you i don't know i feel like I've had a few different conversations with different people about how there are just so many garbage MCP servers so it's always nice to see a curated list of ones that were actually vetted and so yeah looking through this

10:01

you somebody's doing the vetting for you it's always a a good thing i uh I did submit the MRA Docs MCP to that registry i have a feeling they they won't approve it but fingers crossed yeah so Mr curser if you're watching please approve our MCP server um and if any of you know Mr curser uh send him a message uh slide

10:27

into his DMs let him know that we're uh we're out here okay they background agents too we were talking about this earlier today um a remote coding agent so I haven't tried this yet i haven't tried it either but I've been meaning to we we were just we were playing with codeex a bit today um maybe it didn't

10:54

actually work we were we running into all kinds of problems maybe we should try cursor background agents instead true yeah so these background agents what will they allow us to do it's like an async coding agent so normally you're in cursor and you can be like you're interacting with it and kind of like pair programming i believe the background agents you tell it what to do and then it'll go away and just kind of

11:18

do it on its own and let you know when it's when it's ready so then I So you can have I guess multiple of these agents working simultaneously within cursor i think that's the idea yeah but it it is remote so I think it's maybe I don't know if you interact with it in cursor locally but I think it doesn't

11:35

run on your machine it runs on their server somewhere oh yeah remote coding agent it's kind of interesting how they how would they work together you know like within the same files like when when one change might affect another agent's uh change that' be kind of interesting it's like fighting over which agent's going to win but we were talking about it earlier because we're like "Oh how do we do like multiple?"

12:04

Because Tyler you showed us that you had like multiple versions of the the Mona repo and so you would run your coding agent in these like multiple versions but then something like this um you also mentioned like get uh work tree as a way to like work on multiple branches so that kind of seems like I don't know maybe eventually they'll drop support

12:29

for something like that so you can work in in multiple branches in one cursor window so you don't necessarily like run into those conflicts conflicts yeah like different yeah you might be able to do it today i'm not really sure so like git work tree will it basically checks out multiple branches at once on

12:49

your machine so you'll have like two directories each one's in a different branch and you can probably just add both of those as projects to the same cursor workspace it'll probably just work i think for the remote coding agent though like they're all in a different VM so they'd probably just be opening I

13:08

don't know if they open PRs i don't know haven't used their uh remote agent thing but Oh yeah i wonder yeah if you can get it to use like a certain branch already then that kind of like you don't need to worry about that at all it's got to be something like that because they just released that bug fix right in the beginning so the these background agents

13:27

will open up PR which I have another agent to look at oh the bug bot view it's PR this is we'll just sit here with some popcorn and just watch it all happen just click a few buttons yeah oh I guess this is their bugbot it looks kind of similar like we use a company called Griptile for our PRs reptile sweet I like Gryptile yeah I love it

13:56

um I kind of like like the idea of having an a like a reviewer not from the same company that's doing the code just cuz you get I'm sure there are some like biases or just like ways that they have write their coding agents that kind of like look for similar things maybe I'm just Maybe this is all in my head but I kind of like the idea of like

14:23

having an agent from this company an agent from this company and like working together yeah yeah I can see that right i mean I guess you could always have multiple uh agents reviewing the PR too they're like competing see which one does better review kind of reminds me when I uh you know when if I try to use like wind surf and create a PR I'm using claude and then uh

14:49

push up the PR and then grab will catch something that it didn't catch i was like "Oh yeah that that's pretty cool." You know um so something really helpful when I first uh saw GPL in PRs and stuff I was really skeptical i thought "Oh this is probably going to be annoying." So many GitHub bots are annoying but

15:08

it's the best one it's like it always has good feedback it's very rarely saying dumb things i was I was skeptical about it at first too but then yeah it pretty much every PR I have there's like at least one thing that is like very helpful yeah it's awesome and it looks like we got memories also in uh cursor kind of interesting

15:36

i think it looks like it's like per Oh yeah per project right there yeah manage settings oh interesting it lets you know when when it updates the memory and then Yeah so in like uh for MRA is this like the working memory uh would be an example of this kind of memory here that's being shown uh I think it is similar yeah um I think we

16:10

uh I was just looking at the chat saying "Yeah definitely." Send some of these guys that crap energy i do need some caffeine that would be great um yeah I think working memory is kind of similar but it's not exactly the same thing we're going to be adding some some more memory features to BOS very soon

16:35

we're kind we've been kind of like laying the groundwork um to be able to ship a lot of memory stuff quicker so that's going to be happening soon we're going to probably have some some stuff like this you could build UIs around it and and all that but yeah yeah we talked about MCP stuff yeah one click oh and it also handles OOTH too so it must

17:05

Um oh cool enable like the environment variables in cursor somewhere add them in that's sweet yeah I haven't actually tried the MCP thing in cursor yet like so now I want to before I just thought it was cool now I'm like now I need it i render visualizations inside mermaid diagrams and markdown tables that's

17:34

pretty cool it's pretty small you probably can't see and we got a little little diagram there so what is it actually using the diagram for like explain i think it's just a way to like visually show to like the user okay like the the diagram like what it's about to do and stuff so like when you get to the new settings uh in the change log that that's when there's nothing cool after that

18:19

scroll past it no we got we got some other big news right oh yeah yeah they uh they announced that um like a funding round I think series C I believe yeah what was it here i got it i've got it on my uh they raised 900 million in series C funding that's a lot of money yeah what would you do with $900 million probably spend it on all the LLMs the

19:01

users are using api cost it's a lot of money yeah like how how do you spend $900 million there's a way to do it but I don't know yeah uh Mr cursor if you're still watching uh let us know how you're going to spend that uh $900 million crazy too that they make 500 million in ARR i feel like there's no way that's

19:31

profit though i'm sure they're still in the negative after that yeah that's a lot of money though like that is a lot like when did Cursor come out um like it's been like a couple years at this point oh yeah it's been a couple years really i don't know yeah let's see or you know what that's a good old perplexity 2023 so yeah couple

20:07

years i just got a message from from Ashwin he says they're training a model that would uh make sense yeah oh with the with the 900 million yeah i mean I feel like at that point you kind of have to yeah okay i could see that i could see spending that that kind of money on training maybe not all of it but a lot of

20:40

money um all right what else we got we got a we got a big list here yeah I we can we can also move on unless you want to keep talking about news um give me one sec i think the last thing we were about we're going to talk about was the uh VO3 how someone in Brazil made that uh a commercial using V3 for like a city hall advertisement

21:10

yeah like 50 bucks yeah that's if you haven't seen it I' I'd recommend go checking it out i was I was pretty impressed um you want to show us oh yeah just one moment here uh it seems like I need uh I got some permissions uh permission issue here okay oh no worries i I I can uh share this why you had to leave that's why I had to leave they kicked me out and then I joined back it's asking me to leave be

21:44

yeah okay okay so I mean it doesn't look totally real but it looks pretty good i mean for 50 bucks like making a whole advertisement is going to cost you a lot more than that it looks like a hundred times better than anything uh Shane created with the O3 and and uh hundred times less less like creepy and unnerving i

22:12

don't know about that i mean Shane's video was pretty good um uh but it was it's kind of interesting like u how how they generate those uh images and like can you imagine yourself uh if you're a childhood actor you know like far in the future you see an ad with yourself in it and you didn't you were even played in it they didn't tell you that would be pretty pretty yeah and they child actors

22:41

have it pretty bad already just add another thing to the Yeah another cringe moment cool i mean we have more but yeah I think you're right Daniel let's do some coding oh shots fired jane you're not here uh to defend yourself so this is this is when we can uh take some shots um yeah you wanna you want to jump right into it yeah sure we could give some

23:19

background on what we're going to do maybe show your blog post and I'll grab another link too okay yeah I was on the stream I think last week or something like that and I talked about uh a blog post that I did i'll just pull it up quickly just to lay some groundwork we're going to be basically extending um what it was uh so this blog post here

23:51

um basically we were getting uh tool call errors from some models from uh mainly from like MCP server so like tools that you didn't write and uh we found that it was mostly related to the schema being passed in so in a nutshell there was uh I have it somewhere here there was like three different ways that an LLM

24:16

can take the the properties so either they would explicitly throw an error so like if you look at this this example here we found that uh OpenAI reasoning models didn't support this field here so when you tried to pass in a string like property with a constraint to say that it has to be a URI it would throw an error thing like oh we don't support that and so kind of

24:48

sucks if you want to use that model and you're using an MCP server with tools that you didn't write you can't exactly just like change the schema um uh that easily and so another thing that it could do is that it could just accept this field but just ignore it so just pass a string that isn't the URL and then it would fail like uh like schema validation or

25:14

like the third one that accepts it does the right thing uh so whenever we came across a model so we had this like list of different um properties that we were testing and constraints on those properties and we tested it across a bunch of different models and whenever the model didn't support it we would essentially do something like this like add the constraint as a as a JSON object

25:42

onto the description and append it to like whatever description you had for that property already and so we found that this actually worked really well for all these uh models that either ignored the property or like the the constraint or threw an error so we have some some results over here so the I

26:06

almost called this green i know my colors this is orange uh this orange bar was the errors before blue bar error errors after and so you could see it improved it pretty well and so this kind of leads into the next part um so this was kind of built directly into uh the Monster framework so we saw uh a

26:29

post I don't know maybe this is a good good time for you to bring this up Tyler yeah I've got it i can uh can share can you Oh there we go okay so um this is a a blog post from osmosis so they trained this model called called osmosis structure uh 0.6b and its whole purpose is to improve the quality of structured outputs so this is slightly different I guess than

27:04

than what we're doing so the problem that they saw is when you do a single generation with like claude or something like that and you're trying to get it to do a math problem and you ask for a JSON output since it can't print a bunch of like tokens you know to as text and then kind of look at those tokens as it's generating more the quality of the response goes down

27:26

significantly right it's just outputting JSON um so their whole thing is actually they get Claude to solve the math problem and then they pass the text from that into this other model that they trained and then that will convert the text into the structured output um so what this made us realize though like this is not really the same thing right so in in Ma you have agents uh where you

27:53

can just do multiple steps and then as the last step you can do some structured output so this is not really a problem but it did make us realize oh like our code for schemas in in tools could be extended to structured outputs to also improve the reliability of uh following the schema for the outputs right i guess even even taking a step back we're like

28:17

is this a problem that also exists in that context like do if you use uh structured outputs with agents do they throw errors in the same way or have like same behavior with schemas that they do with tool calling and so we looked into it turns out it errored in the exact same way that we were expecting so throw it like when a model didn't uh accept the property with

28:47

a tool call it was the same thing with the structured output it wouldn't accept the property mhm it makes sense because the input schema for a tool is just an output schema for an agent so it's really the same the same problem and I guess that leads into what we're going to work on now which is we're taking that schema compat code that

29:09

Daniel made and we're going to apply it to outputs yeah and so another thing uh we we haven't released it yet uh but we just merged it in is that all of the the schema transformation code used to exist like within the master framework but we pulled it out into its own package because we want to open source the compatibility code so you could basically run this compatibility code

29:35

anywhere like if you're just using AI SDK directly you don't it's not tied to Monster at all and so we wanted to pull it out there to to make it easier to use then also for our internal use we could reuse it in in multiple places easier as well and so yeah we're we we'll try to do that right now uh use it in structured outputs for agents and

30:00

workflows see how far we get you can watch us struggle uh one thing we did also talk about doing later maybe this is a different stream or another day but we're thinking about taking the uh the data set from the blog post that that I was just sharing and running it in Maestra and comparing our results because maybe should we should we do that first because Yeah well I think we need we

30:31

need it to work first right um yeah I guess before and after yeah okay I guess uh do you want me to drive or do you want to Yeah go for it uh you might have a better setup going already for it i think I'm checked out on the the branch where everything's uh installed as AI SDK v5 right now so it might take

31:06

a while to reinstall everything okay for every uh size I make this bigger I become uh 15% less efficient so I'm just going to throw that out there um okay oh I feel like I'm using my parents' computer everything's so big okay um let's see let's see where do we want to start so let's look at uh so we have this schema compatibility package let's see how it's being used in

31:50

so way yeah i got to say one thing at the bottom of your VS Code Daniel Sloo Gatsby takehome number one needs reviewers how long is that that that's been sitting there for years i don't know how to get rid of it it's been sitting there for so long like I don't know yeah that must be like three or four years ago that's hilarious yeah there's also I I noticed like something else

32:14

Gatsby related that is somehow on my computer like um I think it's also related to GitHub like my my user is like still Gatsby or something i don't know but I I remember coming across it the other day and I was like what the Gatsby stuff lingering around yeah so I guess I did okay on the takehome because I got the job even though nobody reviewed it

32:45

uh sorry what were you about to ask John i was I was actually going to ask you like what what are what are the first things that we are going to do like a general overview of the I guess of what this make and move session's about that's a good call maybe let's plan it out i'll just uh let's go in here let's just do like

33:09

a Okay so what we want to do uh let's look into uh the schema compat package and understand how to use it to um let's look at our test suite and see how we can test this easily um and feel free to jump in at any point if what I'm saying doesn't make sense or if you're like you're missing something there uh yeah what does test this easily mean like uh test I guess I meant like without having to

34:00

like build like a a monster project that I can just like test it in in like the script this is for agent in the test suite yeah in the test suite and then let's look into uh the agent class and understand how structured oututs are being used and how we handle the schema there and then write some and five uh do the same for

35:02

workflows so I think with workflows we might not need to do anything i think it's just like if you're using an agent in a workflow or are you saying because can't because workflows you provide it a schema as well but I think that your workflow code like you write the code to make it follow the schema it's like if you're using an agent in the workflow then I think it's already going

35:26

to use what we add to agents so I think we don't have to do anything what happens if you use a tool as a step like um would do you have to do anything there or it should Oh I don't think so i think I think it'll be okay i mean we should we should verify that though yeah we can verify yeah because I I

35:52

think it might only be for MCP and agent tools i'm not sure if it's for like just a tool that you write on your own but we can we can always Yeah we can make sure okay cool sounds like a good plan all right so let's go schema compat package okay we have this apply impat layer we pass it the the schema which is a AISDK schema or a ZOD schema we pass these

36:36

compatibility layers whatever we want to apply to it and then we set a mode so we have all these different compatibility layers um and essentially they have a a JSON schema target and then this function here which tells you whether or not it should be applied and then this is kind of like the whole the whole meat of it is just processing this odd type so any any types that you need

37:06

to change um uh apply these these handlers and then it comes out of the box with all these uh helpers so that you don't have to figure out all the intricacies of ZOD nice cool um so then we have our plan let's look at our test suite so I believe that is builder yeah in here so we have we basically have all these models

37:51

that we're testing against uh maybe let's only do uh this one we know had a lot of issues let's just go with that to begin with cool then we create some some tools so So we already have a a test suite here from before is that right yeah yeah so this is this is a test suite that tests uh tool calling and so I added this which is

38:28

basically a copy of the so with the tool calling we go through this this schema and make a tool where the tool just has this as like a property and so just going through that and just making sure that for X model can it call all of these properties and not fail and that's essentially the test and so we we want to do something similar for

39:02

uh this this new thing except um instead of uh yeah instead of doing a call to the tool we want to figure out the or we want to here maybe it's easier if I so yeah so before we're making this all to agent.generate tool choice required maxep one the agent has this test tool so for this new case uh we want to create the agent and its output should be the schema of the the

39:46

test tool that makes sense so we're just we're using this the same tools from the other test suite but we're just using them for their input schema as the output schema yeah exactly so we we've already done like the majority of this work so let's just reuse the stuff we already have cool that makes sense and so let's see

40:07

if we run this test now bump it so is it uh is it going to are you expecting it to fail or is it is some of the work already done um I think I'm expecting it to fail i didn't do anything yet so I really hope I change the name okay so let's just So right now it's Oh it's actually not running the new tests at all this is running the old

40:51

tests so that's just going to pass okay let's let's skip the old tests so we want to run the output schema compatibility and I'm going to skip the input schema compatibility let's just see what happens there we go this is this is the error that I was showing before in the context of uh the blog post the invalid

41:29

schema property string URL URI is not a valid format and so we're seeing all these errors so this this is perfect this is exactly what we want tons of errors we didn't we didn't like change anything so we expect to see see some errors and these are the same errors that that we originally saw with tool calling but now it's for structured

41:52

outputs yeah okay so now let's go to let's look at the agent class so or agent maybe I'll just ask um or output there should be a should be in the generate function eh Um I think it might be in a a couple different places we also have um there's an internal LLM class and I wonder maybe we should put it inside of that so we don't have to put it in the

42:50

agent because then I think it's like everywhere that it gets called um it'll have it um question in the chat uh is there some plans to add A2A docs on site uh yes i believe that's being worked on on right now uh so should you see sometime sometime next week that we'll have those docs ready sorry what were you saying Tyler um in this file I think if you just like

43:26

command F look for the word output you'll probably see it so oh there's 51 matches oh maybe look for specifically output so we got the generate function here yeah there you go this is This is the one you're talking about from the generate yeah so that should be the schema what's the type of that if you

43:51

uh Yeah there you go output type schema for structured output so you're saying like right here because this is going to be used in multiple places um so so what I'm saying is um we have a L llm class it might make more sense to go in there because if you scroll down farther in here um you can see there's

44:15

llm.c_ext mhm so the actual call to the LLM is going to happen inside of that method oh I see so this this get LLM and then we have MRA LLM base yeah so there should be like a generate object generate uh or I think stream object um those are probably the ones maybe possibly text as well oh you're in the base class you want to

44:48

go to uh whatever extends it oh here and those who are I was going to say those who just joined what we're doing here is uh we recently uh we will be releasing a new package uh probably next Tuesday which uh is has the way to transform um let me get to see if I understand this correctly Daniel uh transform the uh

45:29

response from the LLM to the MCP tool uh in a compatible format okay uh in order to reduce tool calling um and we're just applying this package to agents now is that right yep yeah to reduce tool calling errors so it's just it's a being I guess generalized for any schema that an agent interacts with so we did ship it initially and it was just for for

45:55

tools but now it's going to be for schemas in general so I'm what I'm thinking because there seems to be multiple places uh where this output would be used so I'm going to write like a helper so that we can just wrap the output in it all right let's let's vibe code this i don't type anything by hand anymore i'm too good for that okay

46:26

Shane says "Should this eventually just be in AISDK?" It could be so so that so I think initially before we did this work we saw in their repo there were some like uh GitHub discussions and so people were kind of asking for something similar and their response was kind of um maybe it's changed now but it was like we don't want to necessarily do

46:56

magic in this in the AI SDK and that's kind of like an issue with the model itself on how it handles this and so where we're where a framework so when something goes wrong with the framework it kind of seems like it's an issue with like MRA so if it's like if you use an MCP uh client pull in an MCP server get

47:20

the tools off of it and you're like hey why didn't this work like it works elsewhere um but like you could be using a different model and things like that so it kind of seems like it should be fixed in the framework but like um we're gonna like we said open source this package so you could really use it in with AI SDK if you want

47:45

yeah you get back so when you call the tool compatibility um like helper you get the option to use a JSON schema or an AISDK schema right Daniel so you'll be you'll be able to use it with AI SDK directly but if you're using mroids it's going to happen for you automatically is it in here yes and then let's find where it is being

48:23

used i'm just loading up on some context uh helper function in uh what's this class called i think it's um LLM is in the wrong one where'd it go this one here yeah in the master LOM class that will take an output schema and apply an use apply schema compat utility from the imply they tell you and anywhere I don't know why this

49:36

stuff keeps popping up the output is used wrap the output in this helper use all the available schema compat layers I've attached an example on how it is used as well all right let's see what it does now we just twidd our thumbs yeah get up and stretch make yourself a coffee all right all right i think uh when I was on here with

50:40

Shane if if something worked on the first try you take a drink did you guys take any drinks or did did anything work on the first try actually I think we were trying to do something with MCP and it it worked for the most part yeah I' I'd say we we got things working on the first try like like like 50/50 maybe you guys maybe got a little

51:11

bit tipsy but not not drunk uh mostly hydrated oh okay just some water i mean I guess it is Friday yeah flip a little yeah and this is the uh the master happy hour that's true we should rename it that instead of AI agents hour oh yeah i mean Shane's not here we can rename it however we want there's nothing he can do

51:55

okay it's done thinking starting to move to some action i'm surprised how slow it is actually i wonder if they because Gemini is like very fast i wonder if they're uh throttling it a little bit it usually is faster oh it's got to be before AI okay i mean this looks this looks pretty good maybe experimental output is that

52:45

handled wait so I think the difference so output it'll just output that schema directly experimental output I believe it can do multiple steps and then the last step it does the output as the schema okay but the the type of it is still the same yeah yeah i think it it should still just be a schema oh we probably don't want this change though

53:16

i think we want something like this so doesn't our schema compat accept uh JSON schemas schemas as well oh is this a So it looks like it's trying to determine if it should uh pass in or like convert a JSON schema or just use a ZOD schema directly oh okay yeah yeah no you're right oh I think I undid some things

53:47

actually Gemini was right here you gota apologize hey let's go actually could you go over that experimental output against uh Tyler like what do you mean by multiple steps like the LM will um keep you know recursively like talk to itself until it gets the final format or so when you hit an LLM API you're just generating a single step so it's

54:20

like a single text generation or right you know or like uh schema output right it's just one one step but when you use a AI SDK or MRA or like a aentic framework it'll actually run multiple steps and it kind of runs it in a loop for you um and then right now there's a max step setting that has a default setting i think it's like maybe default is three or five or something

54:47

it'll never exceed that many steps but it'll keep calling the agent until the agent is either finished or it reaches the maximum steps and that allows it to do things like tool calls um or multi-step um uh like inference and then the very last step that output the schema does that does that make sense yep that makes a lot of sense because um there's

55:16

times that it may when the uh when it reaches max steps we don't like what happens when it reaches max steps will it return a response or just it'll return it it just won't do any more steps after that okay so during all all like as it's going through the steps it's already it's actually streaming in all the responses anyway and then it'll just stop streaming once it gets to the

55:38

maximum okay so this looks like it made a lot more changes than it should have so I did notice it said it was going to update any place where the where a schema is being used because I think you just asked it to add the helper and it it was like I'm going to add the helper and I'm also going to implement it everywhere i think I'm deeply sorry for the mess

56:11

interesting oh no the previous have been acceptable i feel bad for it now okay you know what i'm going to I'm going to accept the helper and nothing else because it's kind of a it's kind of a mess uh I will accept this though i wonder if it got messed up from kind of like doing edits while it was while it was working or something like

56:41

that okay so we have this apply schema compat let's just try it on one on what we're using in the in our test suite so that would be generate object right i think so um and then I think for for us that's still the I believe that's still the generate method it's just it Oh no it's text object okay i'm thinking of the agent class okay so we

57:20

have Yeah so let's do Oh there you go and I don't think we need this yeah I don't think so we just need structured output and then this get rid of that any as any as well maybe hopefully oh maybe not i don't know why that was there is not assignable to type no schema that's interesting i think schema

57:53

here can be a um it's like a it's a union and I believe you can pass a string or a schema so we probably just have something uh messed up in our types somewhere oh then we also oh output is uh oh I see this is it should be like this oh object so we maybe do need that check then that it had there because it's a conditionally setting it to array as

58:26

well oh I see so maybe let's keep output but we don't need care about schema uh or maybe well I think you can change the references to schema just to structured output like like what you have on line 242 oh like just set it here yeah that works and then I don't need to set the schema oh okay maybe yeah I don't think I actually

59:09

could just do what's complaining about uh so you probably need to do as const or your or the object and array so like object and then just the word as as const or like the the keywords yeah oh I don't need to it's because uh you can't I think the types it has to be okay that makes more sense what do you got in some problem though and then maybe you have to do it again on 241 or

1:00:08

I mean for now we could just do as any to see if it works and then come back and figure out what's going on with the schema here you can actually do it in line right on 249 I believe just Yeah just that's what it was before i think whoever whoever wrote this before just didn't want to figure it out they ran into the exact

1:00:26

same thing uh okay so now that you saw us me struggle for a bit let's uh so we got to build the core package let's do this so if the test pass does that count as it working on the first try i think it does yeah yeah i don't think I have a drink though oh wait i got one right here uh okay hopefully we did it

1:01:04

on the right method cuz you know I'm getting thirsty motivation you're only allowed to drink if you if you get it right the first time yeah burn this water yeah okay okay well this is interesting wait what did we change here that pipe returned by uh Oh we added the class to like the schema class to the one that's extending

1:01:44

it not to the base class i think that's why so we might just move it to the bas i have a brancher I'm working on AISDK v5 support and I just deleted I just deleted this base class because we don't actually really need it for the LLM class because there's only but aren't we returning the exact same what

1:02:12

would this unknown oh maybe because the type change that we made we do need to do what uh what you was saying oh this says output of any no that's not it wait will it is it my new this one here like if I get rid of that no I guess same thing i guess not maybe this was here before because this doesn't seem related to what we changed yeah we probably did

1:02:54

change something small that caused this but I mean I think right now we just want to see the test pass so what I would do is uh uncomment the that method and just put a like ts ignore let's just see if the test passed then we could figure out what's going on with the uh with the class after we didn't change oh

1:03:11

I guess maybe this might be a different type the schema maybe i mean so the return type here though is from generate object yeah which is not Yeah with the actual PR we'll we'll really fix it but it's a little bit I think it's a little bit boring to debug type issues yeah yeah you know what that that

1:03:36

doesn't count uh for a a first try that's like you just want to be be allowed to drink some water i just want some water why won't you let me uh okay same thing did Did you build it though is this is Oh this is you bu building it or or is this the test no this is me building it oh okay all right i think this is

1:04:06

officially not the first try at this point is it okay oh this is the Okay this is like the git diff yeah um yeah why is this oops cuz we didn't change anything other than this mhm and I guess this So oh hey our type is it doesn't have the schema T it doesn't we don't have this generic so we could do something

1:04:55

like maybe now it's happy it our process schema does it actually return the schema object though or does it return um Yeah oh it does okay yeah so I guess we need to make these generic yeah okay stay thirsty you know what i'm just gonna I'm just going to drink off screen hey it passed wait did it can't see the Oh no it's running all of my Oh it it built

1:05:40

and the the tests are passing i think it at this point it was failing before the test okay so technically this is the first run of the test though so that might count oh okay if we get all pass I'm chugging this whole this whole drink on live stream cuz we're extreme here or Oh hell yeah nice so not bad maybe we can over give an overview overview of what we just accomplished

1:06:13

here it's a Yeah so um if you're just tuning in hello if you've been here for a while I'm sorry um what we are doing here uh we're uh basically applying I'm not going to go over everything again but we're we're applying this uh schema patch for um LLM interaction and so we have this schema patch already for tool calls uh like the

1:06:45

the input schema for tool calling so it makes uh tool calling uh succeed more often when the schema can be handled properly and so we're like okay hey why don't we do this with output schemas now and so uh we want to handle this for u uh all calls to the LLM basically so not just like tool calls um any any interaction that involves like a schema

1:07:18

that you have to pass to an LLM because they don't necessarily handle the schema properly so we want to apply this patch more generally and so we applied it for in our tests here builder um where is it we're running we have this generate function so we just went into where this inside of this function where the output schema

1:07:47

gets passed to uh the LLM and so we did that like schema transformation and so now all of our tests are passing which means that uh the LLM is handling the schema properly and like 20 minutes ago was it like 10% of these tests were failing yeah I think like six out of like 30 of them or something like that and the test

1:08:12

loop through a bunch of different types of schemas yeah so like these are all the schemas that we're testing against so a bunch of different like string constraints number constraints array types object types and just like a bunch of other ones nice so now we're getting 100% success for a 3 mini why don't we turn on some of the other models and see what we get and but first did you did you

1:08:36

take a drink because I think I think that counts as This one's for you Shane yeah i'm going to pour one out on my keyboard all right all right uh what was I doing oh yeah you you wanted me to uh I think in the in the test file yeah cuz we just commented out everything except for 03 Mini we got like 10 other models yeah so

1:09:12

let's buy a bunch of these let's see what happens this will probably take like a minute yeah has to make claims he's not dead but I haven't seen him all week so uh TBD needs to be verified impostor there if anybody's uh in Sou Falls please check on Shane make sure he's okay all right we got a failure so so one I mean that's pretty good we got 86 passing so far yeah oh

1:09:55

just a bad request all right let's uh Okay but we're still we're doing pretty good we got four failed 16 pass so far still running yeah and to be honest like Tupil um I don't know if we have any Tupal fans here but uh Tupils yeah I don't know if you really need them for like for tools and stuff and output schemas with agents but I'm sure

1:10:30

I'm sure we got we got some people who want to use them but it looks like this is just one one model right this is just uh 3.5 high 2 which is not really it's not a very good model anyway right it's a smaller cheaper model so 12.1 mini this seems like a an actual error that we don't want yep we're doing pretty good though up to 286 past 5 i think uh you know what we should have

1:11:04

done i think when we ran the test the first time we just ran it with 03 Mini we should uh temporarily put it back run the tests i don't know if you can do that easily but see how many failures we get with all of them yeah I'll I'll do that after this um okay we're almost done like 30 tests 20 25 honestly this is a lot better than I

1:11:34

thought it would be because we didn't really do anything all right we didn't we didn't tweak any anything about it we just added it the existing logic right yeah it's a pretty good result I'd say all right it's slowing down it's getting tired okay yeah this is this is sweet i feel like out of this the only

1:12:02

error that really concerns me is this one like number multiple of because that's like a pretty basic one and the error that it had was kind of like it it didn't handle the schema properly i was like okay that's something like we could handle but in in terms of like tupil and uh union objects uh what I've seen with those is

1:12:25

like often models just like don't support it and it's like hard to get it to support it with just like a description because there it's kind of like the type of field you can't really get it to pass you that information without having that type of field what we are seeing though is that actually all these models are doing it correctly it's just

1:12:48

um what is it iq iq there's a set at the top too 3.5 sonnet failed on one of them oh yeah oh yeah tupal as well yeah yeah even with tool calls I think most of them didn't handle tupil okay uh but this is sweet um Oh yeah let's uh let's try to revert and see what happens so if I go to You know what i think I could just do this oh yeah just don't uh Okay yeah

1:13:27

wait is this I think it doesn't have to type now it was inferring it oh okay i'll just do that really you might have to just make an empty array literal right there on line 93 oh just do this yeah all right if you have to build though that you had a line that's going to fail the build up above because I think there's

1:13:52

this variable uh I don't think that'll fail the build oh it won't okay i don't think so maybe that's just NCI okay nice nice yeah because in CI we run like lint and stuff like that so there's a lot of them passing right now is this uh I feel like like I'm I was expecting these to be failing like what we saw the last time we ran them yeah yeah we are still calling though

1:14:33

like apply schema compat function though is it possible that it could be adding something uh I don't think so let's just see how this does and then um Okay we're seeing Okay okay okay thanks for guess not making us look like a fool these are They're doing pretty good actually i was expecting a lot more to be failing oh honestly because it the

1:15:11

test starts with the clawed ones and the clawed ones usually succeed oh okay now we're in Gemini maybe you just spoke too soon Tyler maybe so we had what was it five failures with the fix applied now we're up to 27 failures with with it disabled we didn't get any errors with Gemini Ray in the first one so right like with the

1:15:41

with the fix enabled right pretty good i'd say it works mhm yeah i'm surprised we're not getting more failures but you know what i guess that's good okay we're getting It's still climbing mhm okay you know what i I think that's still that's a success yeah yep because we went from five and you know what those five are mostly tupil

1:16:35

so if you're just joining the stream now you know I'm not a big fan of tupil i'm not a tupal head so okay so 40 41 failed that's a pretty big difference 40 41 tool failures to five i bet you we could get those five to work maybe not right now but I bet you we could do it yeah yeah there's a little little bit elbow grease

1:17:06

yeah I mean grease we got to just grease these agents cool well I think uh I think that was a success so we uh we got a little bit over i mean it it's fun we can keep going if you want or Yeah this might be a good stopping point yeah I'm I'm down to keep going for a little bit um because I'm I'm kind of curious to

1:17:36

keep keep going now um we got some positive reinforcement and that's what keeps me going uh if this didn't work I'd probably be like "No let's just jump off now and walk around." Okay so I guess we can try this for I don't need to keep looking at this diff we can try this for other so one thing I remember you were

1:18:15

saying the experimental output one we're having some issues issues with that should we give that a shot all right yeah let's try it let's go back to our old friend mini do you use 03 mini as the testing one because it fails a lot is that Yeah that is I know that it's going to fail for like some at least

1:18:41

uh yeah no no shade to uh open AI um Sam if you're watching uh hope we're cool other Sam yeah other Sam that's open AI Sam i don't know you're using cursor not Windsor here so Oh true oh somebody just left the stream it was probably him okay what was I doing enough Enough with the talking enough with the talking

1:19:20

uh yeah so experimental output here right yep and so I think let's see if this even works okay oh I mean it it's working it's failing a lot um and you you built you built core or or whatever oh I think I think this is actually what was happening before it was failing on everything i didn't look too much into it but I think it was just like I get this no matter what this maybe we need

1:20:00

more than maybe we need more than one max step you like make it two or something let's do three i think I need to build core again too so what are we testing here again so we're testing with experimental output um are there any changes that we have to make in order for this to work or should it already just work with our changes uh so we didn't we only applied our changes

1:20:40

to output and not experimental output um yeah I think I think it's just not working in general oh yeah could not parse the response no object generated so I guess that's why it's experimental let me see if I can find anything on that in the background here okay yeah because we could look at what response we're

1:21:25

getting wait I think it's just erroring probably have to res Oh apparently that you can get that error um just none of the responses had the the right object shape so it could actually just be from 03 mini not Oh like with with this 400 uh oh no no I was I was looking for the other error that you saw i don't know what this is okay yeah because we're getting a a 400 with every request

1:22:05

invalid schema okay no yeah but that's that's kind of what we expected right maybe it's just the error handling is a bit different for some reason okay well I mean it's weird that everything fails though because you'd expect it should be able to handle at least some so that's why I'm kind of suspicious that so the response though I

1:22:28

think it also I don't think it's object it's like experimental something oh no it's uh it's not logging this out it's going here because this request is erroring i wonder does tool choice affect it at all oh like if you just remove this yeah it's like commented out or something yeah no no it doesn't oh wait oh okay okay so that must have been it

1:23:12

then maybe it's um we can't have it as required because maybe that means every step it's required i don't know we don't really need that anyway for this test because we're not using tools for the test so yeah oh actually that kind of makes sense right if we're trying to do an experimental if we're trying to get it to output text and we're not giving it

1:23:35

any tools and we're saying that a tool choice is required there's no way that it could succeed right well we are giving it a tool uh we're giving it Oh okay h okay i don't know then yeah like is it hallucinating those results or it's not getting those results from the tool it's getting the result from this here so it's just we're basically just

1:24:02

saying like hey generate with like a response that will satisfy this yeah oh I guess instructions for this agent doesn't make sense oh because I just copied it from the other one it did pretty good that in that first uh test we ran then considering the system instruction was wrong i mean this is the errors here are exactly what we're seeing with the

1:24:30

output okay nice nice oh we got a we got a question from Ysef yeah what's your question meanwhile Shane also uh made this comment tool choice required always seems to use max steps and this is something I was talking up with Daniel and uh Heather like right before this call i've noticed that as well um this is from the AIS SDK where if you just put a required it

1:25:13

just takes forever to generate a response oh yeah you were saying that that's right did you see that for some like a specific model or just generally um actually I was trying to debug a Discord user's problem and was rec reproducing the error and I was like that's just something I noticed right where

1:25:39

is experimental output or do we maybe use a different function to call if we get experimental yeah I think it's just a the text generation yeah oh we get stream i think the test are using generates you want to do you want to find the generate oh then we have text yeah texting oh it's just text okay so same thing

1:26:09

here uh I don't think we need any of this we could just instead it's not doing the the output type like in the other one oh this has to be wait is this expecting a this is expecting a Oh schema can be schema okay cool so I think we can do the same thing like just uh uh nice oh but we I think we don't want this

1:26:48

right and I think we want just to have this directly because it can be undefined so I think you need like the turnary on 179 you need to check if experimental output instead of Oh I see i see yeah thanks good catch all right you ready to take a drink i still got a little bit of drink left so we can have up to five failures out of the whatever 700

1:27:31

requests to consider a success at least as much as the other one was yeah if I mean this is just 03 mini so this should be 100% okay yeah yeah if we Oh no no it's not working at all did Did you rebuild it yeah oh wait i think something's Oh this is something else okay yeah this is Did I Wait why did that maybe you do need

1:28:07

more than one max step maybe with experimental I didn't actually need to build again oh well that's 10 seconds I'll never have back it only takes 10 seconds for some reason on my machine it's like 20 to 30 oh 16 computer's much worse than yours for some reason oh yeah I was pairing with uh Okay I guess that was it we just needed one more step for Okay I was pairing with you John the other

1:28:49

day and just like watching him build core i was like it's painful is hurting me looking pretty good yeah I Yeah I wouldn't say this is the first time but maybe we could just take a drink anyway cuz I'm thirsty okay yeah look at that you know what i think that because this max steps thing I accidentally put it down to one so that's not really that's not on you you

1:29:20

still deserve that drink i'll go thirsty everyone except for Daniel gets a drink yeah I'll take one for the team uh okay let's go oops what did I click on now let's try for all the models the if Take a drink every test that this is or every test that succeeds that's a lot of drinks yeah yeah that's like what is that 700 or 695

1:29:54

we're allowed five failures by the way anyone's watching you can feel free ask ask some questions about Maestra what we're doing um yeah we're still waiting for that tool question yeah it's killing me i'm just waiting like what what could it be oh I wonder okay it looks like all the anthropic models are failing because they're just like thinking too much they're not we need

1:30:26

more steps just more steps for the overthinkers yeah oh four there you go that's definitely me if I was a model I'll be that overthinking oh you John you are a model oh we saw your wedding photos we're getting the same thing just for anthropic models so yeah maybe they're just not good with experimental output

1:30:59

maybe because yeah they're just responding a bunch of Yeah you can see that it's just like thinking about it weren't we seeing this with like deepseek when we were trying something out before where it was just like it would just keep thinking and you could see that it has the answer but it just like didn't respond with it yeah i mean for this we we could I mean

1:31:26

we could just comment them out for now or we could try just increasing it even more um finish it's weird though because we got finish reason stop it's not like max steps i think you get a finish reason of whatever max steps or something like that if it actually hits it i believe I don't know no yeah it must be something else

1:31:59

i think it's actually just not doing it it's like based on the Yeah that's Yeah it understands what we're asking it understands the schema but it's just outputting it as a yeah like a single response is it still running are are the other ones doing yeah it's we're still on anthropic so I guess we'll see once we get to Gemini see who the true king is

1:32:48

i'm gonna go make sure we don't run out of uh open router credits oh yeah yeah you uh you need Chain's credit card i think I already got it so I'm good uh here I'll read the numbers i have it here somewhere yeah just put in the chat okay so we're still on pod i was probably going way slower because I added so many maps yeah go

1:33:42

like how many did you add 10 10 oh yeah you should just cancel this and comment out anthropic ones see I'm just really curious if any of them work with this because I mean output having an improvement on output i think that's that's a win but if this works too like that's that's that's awesome if it's just a couple models that can't do it no big deal but all

1:34:06

right we're still getting some fail failures same here oh yeah i wonder if we're doing something wrong mhm cuz yeah like we can see it is actually returning it it's just not is it or it's like Yeah maybe maybe we don't quite understand how this experimental output works like this might be wrong somehow

1:34:42

does it have like a a what was it like mode or something like that for the other one uh what do you mean there was like another key that we had um I don't know if it was like schema or what was it key for what uh when you were calling generate object I think there were two two keys oh there was like an output and you can set it to like array or object

1:35:11

i wonder if we have to do something like that let's see take a look at the docs now to see if I can find anything next step any steps tree options doesn't look like it yeah I don't see any other execute oh um on the if you go to your test for a sec mhm and it's just climbing up onto my lap it's his uh dinner time so he's being extra annoying right

1:36:16

now uh so I go back to tests yeah and then go to where we're actually making like the generate call mhm uh so on response do we have anything else besides object wonder if there's like another another key or something uh there experimental output right there at the top yeah we probably need to check that instead oh like this should

1:36:50

be Yep i think the other one might be just I don't know if we're actually doing anything with that i think we're just checking to see if it doesn't fail oh it throws an error yeah if it if it didn't work uh okay all right i think it might be the same thing yeah oh shoot because there's Yeah because this is logging out this is like an error from the API call got it

1:37:32

um yeah i mean I guess this is this is partially satisfying but maybe we should or unsatisfying but maybe we should call it call it here because this is probably going to be not that not that interesting to to keep going with this unless you have any ideas right now i was just reading docs but I haven't found anything wait can can you go to the uh model.ts TS file again for one

1:37:57

sec and then okay output object says that you can I don't know if it's a hallucination I'm just looking at uh perplexity But it says there are different output strategies object default array enum no schema but I don't know if that's actually true that might just be for the other one right right and I don't know we should have quit while we were

1:38:44

ahead i know i think we got a pretty good result though that's like a that was a huge improvement because how many failed like for the regular output without the for yeah for the the regular output when we were just using output instead of experimental output um it was like 40 something of the tests failed and when we applied our our

1:39:11

schema compatibility layer only five tests failed and most of them were tupal screw you tupal Um nice there's going to be a person named Tupal somewhere out there in the world just watching this what I do um Tupel if you're out there come on the stream sometime box it out yeah cool well this has been this has been fun i

1:39:47

guess like we we talked about some some AI news a bunch like a a new Gemini model uh new like Gemini 2.5 Pro that looks very promising um talked about cursors new new uh new version 1.0 lots of cool stuff what else did we talk about they got 900 million yeah 900 million on 500 million revenue and there's some other stuff but I mean I think the the programming

1:40:26

went pretty good like we we didn't we didn't get the experimental output to work but we we did get regular output to work so I think Yeah I think we're going to be able to ship that pretty soon here yeah so that's pretty exciting i bet you we're just like one step away from getting experimental output to work too

1:40:42

we just got to do a bit of bit more debugging cool yeah any anything to add Tim no okay well well this was fun thanks everyone for for watching um we'll have the three of us have to stream again soon because it was actually it was fun streaming with you guys yeah yeah let's let's do it again sometime we should do the same thing do some AI news and then do some programming yeah if if Shane lets us he

1:41:13

uh he runs a pretty tight ship i'm sure we can convince him cool all right well uh enjoy your weekend you too see