Back to all workshops

Building AI Voice Agents with Mastra and Roark

April 17, 20254:00 PM UTC1 hour

Voice agents are quickly becoming one of the most powerful and intuitive interfaces for AI. In this hands-on session, you’ll go beyond chatbots and learn how to build and monitor real voice-based AI agents using Mastra and Roark AI.

Whether you’re dreaming of an AI receptionist, a customer support rep, or a personal assistant you can talk to—this is your chance to build one from scratch.

Join members of the Mastra and Roark team to:

✅ Build a voice agent from the ground up using Mastra

✅ Connect it to real-time speech input and audio output

✅ Monitor and replay conversations with Roark

✅ Learn how to continuously improve agent performance and reliability

We’ll cover both the technical foundations and the tooling that makes iterating on voice agents fast and effective.

This event is beginner-friendly but best for developers and aspiring AI engineers.

Workshop Transcript

3:10

James Daniel, how's it going? Good. How are you? I'm doing pretty

3:18

well. Eug, John, good to see you. Even Even though I just saw you, good to see you again. And for everyone else, we'll uh yeah, we'll probably get started in

3:30

about five minutes. We usually like to give people time to to get in. I've always thought I should just have some good like weight music lined up and just, you know, like some elev elevator type music just kind of going on in the background. Yeah, exactly. So people come in, they're like, "Okay, this thing's not started yet, but it's

3:53

going to start." So, I don't know what you know. Oh, well, you have the guitar behind you. That might be Yeah. Oh,

3:59

yeah. Should I just play it? Yeah.

4:34

That's much better. Yeah. Yeah. Yeah.

4:39

Gives a little gives a little professional feel. We like to do things professionally around here. Good to see you all. Well, we are going

5:22

to get started in probably about two or so minutes. There's a lot usually like to give people a few minutes as they often are coming from other meetings or or whatnot. So, give everyone a few more minutes before they uh before we actually kick it all off.

6:04

While we are waiting, if you do want to drop in the chat where you're calling in from, that's always uh good a good way to kick us off. So, as of right now, I am in Sou Falls, South Dakota. But I kind of wish I was where James and Daniel were at.

6:30

Yeah, we're we're out in in Malta right now. Seems way nicer. I mean, I don't know that for sure, but I'm guessing that's why it's not an extra. So, we got some people from UK, Paris, France,

6:57

Vancouver, Austin, Texas. All right, we'll give we'll start in about 30 seconds in. Yes, Abby, we we today's the first day we have elevator music upping the production quality. And with that, the song just ended. So maybe that's the sign. That's the sign that we're going to get it

7:22

kicked off. All right. So, let's see here. All right. I'm going to go ahead and share my screen. and we are going to kick this thing off.

7:39

Thank you everyone for joining today. It's a it's a really going to be a really fun one because we have a number of people that are going to uh kind of show some demos and talk. So, we have you, John, Daniel, and James, and I'll introduce everybody here as we go through. I'm sure there'll be more uh people coming in, but definitely ask

7:58

questions along the way. We'll try to answer them. If we don't get to them all, we will have time at the end. And

8:04

we will also be sending out the recording. So, if you did have to leave early for some reason, it'll be a few hours after the event, but we will send it out and you should just get it in your email or whatever you used when you signed up for the event. All right. And with

8:18

that, let's go ahead and get started. So, I'll go ahead and share my screen. And so we'll start with kind of just going through a few slides talking a little bit about what voice agents are and then we'll spend most of the time in actual demos because that's the fun part. People want to see how it actually

8:39

works. But let's start with what the goal is of this workshop. We want to basically build, monitor, and test voice agents. And so we want to learn some of the tools and tricks for actually doing that.

8:51

And we're going to cover what voice agents are, what are some of the use cases people are using voice agents for today. Learn a little bit more of how you can use voice within Mastra and then how you can actually monitor and test those voice agents with AOR. Then I have some time for Q&A. So about me, a little

9:08

bit about me, I'm a co-founder of Mastra, the chief product officer. I was formerly at Gatsby and Netlefi. I built a product called audio feed and I been in open source software for quite a few years now but not to date myself too much hopefully. Eug John is was previously at Netlefi and he's kind of

9:29

our voice and one of our cloud experts on the master team. So you definitely reach out and connect with him and these slides will be shared as well after the event. So, you'll get this in that follow-up email and excited to also introduce some uh friends of friends of ours from ROR. So, James, he was

9:47

previously at Angel List. He's been spending the last 10 plus years, last decade of his life in web and cloud development and he's the CEO of ROR and uh Daniel is the other co-founder of ROR and uh CTO and he's previously at a YC startup and also you know kind of been doing this doing this for a decade. We're getting old. All right, so let's

10:11

talk a little bit about uh voice agents and why now they're becoming more and more popular. They they've been around for a little while, but a lot of things have started to come together that really make it a really good time to uh start to build voice agents. You they're starting to have this, you know, shift from just a text chatbot to more natural

10:31

voice. People want to get their questions answered quickly and directly and typing something out on their phone through a little keyboard is sometimes not the best way to to actually do that. And also with the model latency going down, the costs going down, some of this stuff starts to become more feasible where before, especially because voice

10:48

is so latency sensitive, before it was really hard to do. It's becoming a lot easier now. So talking a little bit about what makes up a voice agent, the more traditional typical flow is you have some kind of speechtoext provider that takes audio, turns it to text. You have some kind of LLM that does the reasoning and actually generates a response. And

11:09

then you have a texttospech provider which you know then kind of speaks the response back to you. So you kind of have this pipeline. The nice thing about this is you kind of have three different uh pieces. You can it's pretty modular but

11:21

each step in the process does increase the latency a little bit. That's why it's really important that LLMs are as latency sensitive as possible that you can do the transcription really quickly. You can do the text to speech really quickly because any additional latency in voice starts to kind of break down on

11:36

the experience. There's also um something that's a little bit more new is kind of real- time voicetovoice models. So essentially it's all just one API or one model that does all three of those steps in one. And so because of that they can reduce the latency on you know the handoff

11:54

between those steps. And so OpenAI real time it kind of streams and transcribes words as they're produced. So it's actually creating the tokens and then streaming the tokens as they're actually produced. So you don't have to wait for

12:06

the whole response to happen before you can start streaming it in. So you can really reduce latency, but because it's kind of a, you know, one black box, so to speak, it is harder to to debug still. It's because it is newer, it's a little less flexible, so you don't have quite as much control.

12:23

So we kind of see that there's really like three common architectures that we've been seeing with customers that we've talked to, you know, that I've, you know, I've spent a lot of time talking with RORO and they've told me some of these same things. So I' I've picked up a lot of this from them as well that you kind of have the the traditional pattern of speech to text to

12:41

LLM to text to speech. That's definitely more mature because it's the more common way. You do have a little extra latency.

12:49

So that that is kind of one of the big drawbacks there. Then you have the voice to voice which is the lower latency option and the ability to kind of show more emotion in in the speech but it's still earlier stages and it's a little less flexible right now. Then you have some of those full stack platforms which

13:06

are battery included. I always like to equate these to like the no code website builders. Like they're pretty good and you can get really far with them but oftentimes you get kind of locked in and you have limited customization. So

13:18

there's definitely some trade-offs there. as far as um some industries that are using real-time or I guess I should probably just say what are use cases on this slide but we've been seeing a lot of use cases in customer support financial services healthcare and then you know kind of training and coaching so a lot

13:42

of like back office a lot of customer uh success customer service you know kind of uh kind of types of use cases that we've seen quite a bit as we've been talking to customers that are using MRA. And so there are, you know, maybe I don't know if anyone here has experienced it yet today. You can still

14:00

sometimes tell when you're not talking to a real human, but a lot of times it only handles part of the customer service, right? You can still get to the human. a lot. I think that's most people have success when they can take out like

14:12

certain tiers of support, kind of the easier the easier parts, and then they still often end up going to kind of that actual human in the loop to to respond to the more complicated requests. Uh I would have some kind of general learnings that we've had as we've talked to customers building voice is you it's best to start simple. pick a

14:32

more narrow use case with, you know, kind of a clear return on investment, you know, clear like understanding of what kind of latency is acceptable. And then the other thing I would suggest is continue to revisit the architecture as these real-time models get better. I really do think that more and more of

14:49

this is going to end up as those real-time models being the best solution. It's not quite there yet for all use cases, but it's probably something worth evaluating every quarter or every six months as you after you kind of build out your voice agents. And then, you know, I would also encourage you to build evaluation and monitoring in the pipeline from the beginning. So, you can really make sure that as you

15:10

scale, you have the right pieces in place to know that your voice agents aren't going off the rails, right? They're they're making the right decisions and they're actually solving problems for you. And before we jump into the demos, I'm going to do a quick introduction to what Ma is and we'll talk a little bit about Ror as well. So I'll toss it over to James here in a second. But MRA, if

15:29

you're not familiar, we're an open source AI agent framework for TypeScript. We have a whole bunch of things kind of out of the box. You have agents with tools, memory, tracing, and voice capabilities. We have agentic

15:40

workflows which are more deterministic with simple APIs and human in the loop. We have eval framework built in. We have rag built in. We have a local development playground which makes using it all very easy. The goals around

15:55

Maestra are to be opinionated but flexible. Meaning we have sensible defaults so you can get further faster but we don't lock you in so you can kind of swap anything out that you need and make sure it really can grow with you. And I'll hand it over to you James to talk a little bit about Ror. Uh thanks Shane. Yeah so uh you can think of RO as

16:13

a data dog for Voice AI. to the idea is that we help voice AI developers like yourselves figure out if their agent is going to go off the rails or not. And we do this by providing uh tooling that allows you to monitor and evaluate the performance of of your agent both in production and while you're testing. We allow you to run simulations against

16:31

your latest voice AI changes and check for regressions before you deploy a change. Whether to the prompts, whether as to tool calling, uh whether it's, you know, testing out a different speech to text model, you know, before you deploy uh your changes, you can use work uh to tell you whether that's going to cause any regressions or not. And then we also

16:48

have some really powerful analytics and reporting built uh baked in that allows you to analyze the behavior of your agent across all of the aggregate data that we collect across all of your calls. Awesome. With that, I'll hand it back off to Shane. Yeah. And with that, let's actually just jump into the demos.

17:05

That's what everyone's here for anyways. And so I'm gonna hand it over to you, John, to we're going to start by first showing how you can build voice agents in uh in MRA and then we'll also then show how you can analyze and monitor those using ROR. So yeah, give me one second. I will stop sharing and I will

17:27

allow you to take over Eugen. All right. Thank you, Shane. Um so yeah before we begin I'll talk a little bit

17:38

about uh what Masha provides in these demos. Um so as Shane mentioned voice agents are becoming more popular and here at Masha instead of becoming uh creating a new uh or a different framework for voice agents we are going we want to give the ability to give voice to your agents instead. And we do this through uh a unified API um

18:03

providing text to speech, speech to text as well as uh as well as speech to speech. So in this I believe I'm sharing my screen here and we have our voice workshop demo which will be repo which will be available uh after this workshop and the the first thing we'll take a look is uh the API itself. So here we have three different providers all implementing the speak method which is uh a method for text to speech and we

18:36

have a so what we do is we import our uh an instance of our Masha voice provider for which is from our Masha voice Dgram here we also have one for open AI and one for the real time API. So how you would you use this on an agent? Uh we'll go take a look here. There's a couple different ways. So if for a single

18:59

provider that offers both speech to text and text to speech, um what you can do is the agent has a voice property where you can assign the open AI uh or a sorry you can uh assign an instance of the Masha voice u provider. So in this case we're using open AI. Now if you want to use a different provider for speech to

19:23

text and one for text to speech, what you can do is you can use a composite voice which is imported from our Masha voice uh core pack or Masha core voice package. And here we are assigning Dgram as our input for our speech to text. And here we are assigning our open AI uh as our text to speech uh here at the output property. Now when it comes to real time

19:55

API or like a real-time models that offer uh speech to speech capabilities, it's very similar to what we've seen before with a single provider and uh like so with here's an example with OpenAI real-time voice. So with that being said, this going we're going I'm going to go over the different building blocks of building of creating these agents. Uh starting off with speech to

20:21

text. So here we can uh it's very simple example where we are importing our open AAI voice and we are taking an audio file and creating a stream from that and we are passing it to our listen method uh on our voice instance. So here at Mastra, we want to personify our agents, which is why we stuck with the the methods listen for uh speech to text and

20:46

speak for uh text to speech. So once we transcribe this audio file, we're just going to log it out. Uh and yeah, let's see how it goes.

21:14

So this audio file actually I should have played this before. Grocery bags especially the reusable ones. That was just a test to see if everyone could hear the audio as well. Um so it's a it was a debate about grocery bags and you

21:28

know how they provide uh it was a debate created by two agents actually we'll look into this demo later but what we did now is just see that okay we have our transcription here. So the next part is how do we use this with an agent? And so here in our second example, what we're going to do is we're going to take the audio from my mic and then we're

21:50

going to pass this into uh our agent listen method to transcribe it. Then we're going to pass this into our generate function and see what the agent responds with. And we're going to log that out.

22:10

Why is the sky blue? So apparently the sky is blue because the molecules in the earth's atmosphere scatter sunlight and as you can read the rest that's why the sky is blue. So this is just a very short demonstration of how speech to text works with mastra. Uh the next lesson we'll go into is text to speech.

22:35

So here it's very similar except now we are using the speak method and we're passing a string on what we want to uh what we want the uh speech synthesis to happen or we're creating an audio stream and we're going to play it to our speaker here. Hello. how are you doing today? All right. So now uh when it comes to

23:08

the agent itself uh we're going in the second example, we have our agent who's going to generate uh who has a capability of looking up things online. So he's going to search up for the latest news about the Lakers and just uh talk about the the headlines uh the top three headlines there. And then once that we get that text, we're going to pass that in to our speak

23:32

method uh where we specify, you know, a response and this uh and where we specify the audio file which is wave here. And we're also going to just bump up the speed uh just for this demo here. One, Luca Donsik's 39 points lead Lakers to clinch third seed in Western Conference. Two, Trailblazers dominate

24:11

Lakers 10981 in regular season finale. Three, Mavericks GM defends Luca Donsic trade to Lakers. Expresses no regrets.

24:19

All right. Well, we heard it. Maverick's GM has no regrets trading UG Donic. Um

24:26

so what Shane mentioned before about the the the different providers we have one for uh speech to text we have one to generate a response and we have one for uh transforming text to speech and this is just these two de or these two examples were pretty much showcasing that. Now before I hop into the the open

24:49

our our voice to voice uh into our voice to voice example using a real-time model we do have one more example of text to speech and this one is simulating two agents uh having a debate um about a topic that I will provide. So if we scroll down we have we fetch our two agents from the Amash inance instance we have our optimistic agent and then we have our skeptic agent and

25:19

then uh we pass these into the this method called process turn and here is just what we're saying is okay so based on the previous uh response uh the previous agents response uh you know come up with your own uh respond to their their talking points here and before I And here I want to introduce a new package called node audio package

25:43

and our app master node audio package. And we created this package because when we were creating these uh demos and uh for this workshop we found it very cumbersome just to get started you know just if you want to get the you know if you want to play audio or record your mic or even record the audio to a file. Um so we created this uh masha node

26:05

audio package to make that to simplify that process. So, highly recommend to check it out. Uh, we're going to continue iterating this as we, you know, as we create more workshops and more of these demos. So, it's it's a fun package

26:17

just to get started. So, for this topic, I'll give we're going to talk about Seattle because I'm living here right now. Seattle is a vibrant city brimming with innovation and natural beauty. Known as the Emerald City, it offers a unique

26:45

blend of urban excitement and lush green landscapes, making it a haven for both tech enthusiasts and nature lovers. With iconic landmarks like the Space Needle and a thriving art scene, Seattle is a hub of creativity and opportunity. Sure, Seattle's got its attractions, but let's not ignore the notorious traffic and skyrocketing cost of living that

27:11

plague the city. It's hard to enjoy all that vibrant innovation when you're sitting in congestion or struggling to pay exorbitant rents. And as for being a hub of creativity, doesn't every major city claim that these days?

27:26

Well, I say they both have pretty valid points. you know it is pretty green and vibrant here but cost of loading is quite high. So now we're going to dig into the voicetovoice uh sorry our our voice to voice how to imple implement voicetooice agents um and using real-time models. So one of the main differences between this

27:53

example and the past two other examples is that now we're there's these mo real-time models um require different methods as well as handling um we also use an event- driven architecture as well as you know a different kind of connection or protocol to uh send information. In this case we're doing we're using websockets. This is all to reduce uh you know reduce latency uh for

28:21

our responses here. So I'm going to give a high level overview of like what the methods are for when it comes to implementing these uh voicetooice capabilities on these agents. So first we have our get microphone or sorry this is coming from the uh the node audio package. So first we start with connect. This uh allows us to this sets up the

28:48

websocket connection and we also have close which disconnects the websocket connection and with these voice uh real-time providers you can give them tools. So one way to add them uh add tools to these models is through the method add tools. Now if you create your agent with tools already attached like

29:10

I'll show here an example. These tools here are already passed to that model. So you wouldn't really need to uh use that method at all. Uh we also have

29:22

a way to update the instructions or the system prompt. Um agents or agent voice speak. So before this would return a readable string, but now it will emit a speaker event. So the speaker event is when the agent's responding.

29:38

We also have a a method called a uh voiceless send and this is stream. This is where we send my response or audio to the agent and then we also have a listen event or listen method which will emit a writing event. So this is for transcribing audio. So in our writing uh

29:58

event that's emitted we have two roles the assistant and user and then we can you know we can also log out the text that's that or we can also log out what was being said here we have a way to update the config. So the config in this case is the the session config. Uh you could think of a session as the entire

30:18

conversation uh between myself and the voice agent. So in this example here we are going to uh you know we're going to change configuration on our uh voice activity detection and what this will do this will emit a session updated event which you can you know add handlers on and if you're not using voice audio detection and as we stream my response to the

30:45

agent we're going to need a way to trigger response from the agent itself and that's where the answer method comes in and this will then emit a response created event and followed by response done event. We also have an event to listen when uh when a when the agent uh in invokes a tool. So, so this is the cool tool start event and then we also have

31:10

one for uh after the the tool has been called and as after the execution has been called uh this will emit a tool called result event and then finally we have error uh for error handling. So now we can go into our example here. So this example is uh is a voice agent uh where we are going to record my audio and we're going to upload it to

31:39

cloudinary and then send that to work. So here we have uh we initialize our work client with their SDK. Um the speech uh the create conversation is pretty much the is the method that has all the of the logic um when it comes to uh the these interactions. But before I dive into that, I wanted to

32:04

take a look at specifically this method on conversation ends which will be called when the conversation ends. And this is where we take our audio file and we're going to upload it to Cloudinary and we're going to get our URL. And then we're going to pass this into the work SDK.

32:23

Um here where we also pass in when where we also pass in when the uh the conversation took place. Here we're simulating an inbound phone call. And then we also have uh pass in the participants. Here we have the agent and we're specifying that the agent spoke first and then we also have myself as

32:43

the customer. We can also pass in uh tool invocations. So this is like uh we're going to pass in a list of tools of when they were called and what they were called with. Uh and then we're just

32:55

going to log this response. So when we go into create conversation um I was just going to highlight the main thing since we're running low on time here. Um, we're first going to get our agent from the master instance. Uh, then we're going to listen for the speaker event. So remember

33:14

that's when the agent speaks or that when the agent sends a response and we're going to play that send that stream and we're going to play that audio. Next, we're going to uh then listen for when the tool call starts. This will allow us to know uh this will allow us to give the information to work of when a tool has been called. And then after once it has

33:38

been called and it's been resolved, we're going to pass we're going to take that tool information. We're going to take when it was started and then we're going to pass it into an array that we will send to work. Um and huddle this is another method that's uh being created that is being imported from our node

33:57

audio package. And this will handles the microphone, the speaker as well as the recording for us. And finally uh when we want to start the conversation, what happens is we first start our websocket connection then we start our huddle which is again the audio and the recording.

34:20

And then we're going to take and then we're going to send my audio my if we're going to send my audio data to the send method. Um this is to for me to send a response to the agent and then if there's an init initial message that the the voice agent will is going to emit a speaking event. And yeah, I think that's about it. So we're going to get started and

34:45

see how this demo goes. Howdy, partner. Hey, what's the weather like in Seattle?

35:03

It's clear in Seattle right now with a temperature of about 10.7°, feeling like 8.3° C, and a moderate breeze. Okay, cool. Glad you think so.

35:18

So once I stop that conversation, uh we're going to go back to here and go and take a look at these logs. So here we can say uh the log send to work URL. So this is our audio file here or the URL to our audio file here. And then this is the response that we're getting from work. Uh we can see that

35:38

the status is pending. Um and we're we got we are sending uh we have our information about myself as well as the agent the when the time was called as well as what kind of call this was. So um just to wrap this up this is a a way of how you can create uh voice agents with Masha. So I'm going to hand this off to war who can now explain how we

36:04

can use our this this conversation audio as well as the other additional data to monitor and uh you know evaluate our agents. Uh thank you John. Um so we can get started by actually going over our integration docs just to give you a little bit of more context on how we would integrate uh a voice agent with

36:30

truck. So let me share my screen. Yeah. So uh as you can see here uh we basically have uh an SDK for both

36:47

node and for Python. Um, obviously here with master we're using the Node.js SDK.

36:53

Uh, if you're working with another language, you can also hit our API and that will allow you to send the call over to rock. Uh, when it comes to data that we need, uh, as you already showed us, um, it's very simple. From our end, we mainly need an audio recording. From

37:09

that, we'll basically do everything else. We'll handle the transcription, we'll handle sentiment detection, every other thing that you would need to do, we take care of that. When it comes to the properties that you need to pass us other than the recording URL that you can see here, uh you can also pass in any additional properties that you might

37:26

need. Uh these come in very handy if you want to filter some calls. Let's say you have some calls that are bound to a particular agent or to a particular customer for example. Uh you can use these properties to filter both in the

37:38

calls list which we'll show you later as well as in the reports. Uh besides that uh you can also pass in tool invocations. Uh so tool invocations will allow you to evaluate any tool calls that you make across your agent. Tool

37:52

calls tend to be a little bit tricky. Uh it's one area where the LLM tends to hallucinate. Uh especially when it comes to calling the tool or calling it with incorrect parameters. So passing that over to us will basically allow you to

38:06

easily evaluate if they're being called correctly or not. Uh finally uh the participants is one of the required fields. uh there the main thing that we need to know is who spoke first. Yeah. When you basically just need to pass the

38:20

agent and the customer details and tell us who spoke first. Based on that, we'll be able to transcribe and direct the the entire conversation. Um so what does this look like? Uh once you actually post the

38:32

call, you'll actually start seeing this on our dashboard and there you'll basically see the transcription and all the details that we get from that. This is a very quick example, so I'll hand it over to James who can show you a better example of what this would look like in production. Thanks, Daniel. Thanks, John. Okay,

38:53

cool. So, let me go ahead and share my screen here. Um, cool. So, yeah. So, uh, for for this demo, uh, what we're going to highlight

39:04

here is let's say that we're building a voice agent and we're building one for a dental clinic. And the goal for us is to be able to schedule appointments for customers um along with answering some of the common requests. So what Daniel and Eon just showed you is uh the way that you would build that agent using mustra and then how you would send those

39:22

calls over to work. So for every single call that comes in, you'll be able to see them over here on the left hand side. Um and then um once you tap on a single coil, we'll automatically handle transcription. We'll handle um everything in between from trans uh from

39:37

transcription to sentiment detection. Um highlighting the tool invocations. Uh we'll run evaluation. We also are able

39:44

to capture some very rich emotional and vocal cue information. Um and so the call that you're seeing in front of you is an actual call between a customer of this dental clinic and the voice agent that's running. Um, and so just to help gain some context here, let me go ahead and uh play back the audio. I just realized I shared without sharing sound,

40:02

so let me just reshare my screen real quick. Cool. Okay. Hello, this is Mary from

40:11

Mary's Dental. How can I assist you today? Hi, I'd like to book an appointment for a root canal treatment, please. Preferably on Tuesday at 3 p.m. if that's available. Um, I'm sorry, but we

40:22

don't offer root canal treatments here at Mary's Dental. However, I can help you with a cavity filling if that's related to your issue. Would you like to book that instead? So, what we heard here is a pretty strange call, right? Essentially, the customer pulls in. Uh

40:34

they want to book a root canal treatment. We know that every dental clinic on the planet supports root canal treatments and so does this one. U but unfortunately, the agent screws up. It starts to hallucinate. And the reason for this is because uh the way that

40:47

these uh the way that this works is you would define the treatment types either in the prompt or in your knowledge base. And in this case, the developer just simply forgot to do so. And so the agent just doesn't know what to reply with. So

40:58

it starts to hallucinate or offer alternatives, which in this context is strange. Um, so what we do as work is for every single call that comes in, we will run a set of evaluators across those calls. These evaluators can be defined by yourselves in the evaluators tab. You can go in add in some of the testing criteria. For this uh for this

41:17

demo, we just have one running right now, which is called answer relevance, which is essentially going to test whether the the response the agent gave is relevant to the customer's request. And uh what we can see here is we highlight these failures from the evaluation results in two ways. We do so in the transcript. So we'll tell you,

41:34

hey, there was an irrelevant service offered, right? That the agent um offered a cavity filling, but it didn't really directly answer the customer's request. We'll also highlight these failures directly in the waveform. So if

41:45

this were a 20 or 30 minute call, um depending on the use case, you wouldn't need to listen to the entire thing. You could really quickly just scrub to the pieces that you care about mostly. Um and then you can also see a summary of the evaluator just by click tapping on it and it will show you the score, the reasoning behind it. And then all of and

42:04

then entire breakdown of the relevant segments within the transcript that led to this led to this evaluation result. Um, on top of that, we also capture that bridge sentiment information, emotions, and vocal cues. And what this means is we'll be able to capture whether the user raised their voice or whether they were interrupted. Um, and then we use

42:23

all of that audio information along with the function calls, uh, the transcript to evaluate whether your agent did what was supposed to set out to do. Um, cool. So that is the core foundation of our product. Now, on top of these, on top of this functionality, we've built two other um, features. one which is

42:41

reporting and then another uh which is simulated replays. Um so let's let's get into reporting next. So let's say that we've now you know we've now this is live you're in production you're now making like hundreds or thousands of calls per week. Um and this screen in particular is great if you know the

42:58

exact call that you want to go through but that is usually the issue right in most cases you want to see a highle view of how your agent is performing. And so what we built here is this concepts of dashboards and reporting where you can go in you can create your own reports um attach them to dashboards. This is some sample data that we put together here u

43:18

for this use case. Uh but essentially you can see things like here are the total um number of successful calls what the average emotional sentiment was over the past couple of um uh you know against this this time period. You can see the top reasons for call failures. And then one of my personal favorites is

43:36

this call topic flow which shows you a high level of the most common paths that your agent has taken. But let's show you how easy it is to create one of these reports yourself. So let's jump into the reports tab. Hit new report here. What

43:48

you're going to do is let's let's say that we're going to select uh the past seven days and we we're interested in seeing all of the calls where the evaluation has failed, right? And so now we see we have seven. Okay, let's let's go over past 30 days. Great. Um, this is

44:03

interesting but it's still somewhat super useful. What you can do from here is you can actually hit on the breakdown button here and break this down by evaluator name. Um, and then now we'll see that we have essentially if you have multiple evaluators, you'll actually be able to see which evaluator has failed. And then you can also see the matching

44:20

calls for that specific evaluator. But what's cool here is you can actually keep adding more and more metrics. So let's say we're going to add let's see all the emotions that were detected and we see okay we have 480 emotions but now let's break that down by the emotion and so we can see that we have okay 300 uh

44:38

emotions of sadness that were detected over the past 30 days. You can even tap into these and see all of those exact relevant calls. Um and then you can either group by the total number of events or you can group by the total number of unique calls. Um cool. So

44:55

yeah, so that's so that's reports. Um, so the final thing I'd love to show you is the concept of a simulated test set. So I'm just going to jump back into the failing call that we saw earlier on and just go here. Cool. So let's

45:14

just Okay, great. So let's say that we want to fix this issue, right? What a developer would do today is they would go into their prompt, they would update it or add to the knowledge base. Um, you

45:25

know, they add a new treatment type and then they would literally grab their phone, call their agent and try to mimic the customer in this conversation. And so what we built is this idea um where you can simulate a replay of an existing call where with a tap of a button, you select the evaluator you want to test

45:42

for, the agent you want to test against, and then you just hit start test run. And what's going to happen here is work is going to call your agent. It's going to literally go through the phone lines, call your agent. It's going to act as

45:55

the customer and have a full end to-end conversation. And it's going to be mimicking that customer that we saw in in that specific call. Um, and the cool thing about this is not that you can simulate one call, but is that you can simulate hundreds or thousands of calls in parallel at the same time. And so this is a great way this is great to add

46:15

to your CI/CD pipeline. Um so that you know anytime you say you're going to update your prompt you can have a test case that you run uh before you deploy that prompt to production right or you add a new tool call or you change something in your agent. And so this will take the same duration as the actual call. So let's just jump to the

46:31

results here. So once the test cases have completed you'll actually get a very simple pause or fail. You can even tap into it and um you'll be able to see how the simulated conversation went. So

46:43

in this case you'll see that Ror acted as the customer. It also put a root canal treatment. But now the agent no longer got confused. It actually um went ahead and asked for your full name. Um this is assuming that we went into our

46:56

prompt and we've now you know modified it and we just tested it again. Um and so yeah, we just went from a essentially from a failing test to a passing one. Um, and yeah, those that's simulations.

47:09

Um, cool. With that, I'll pass it back over to uh Shane. All right. Yeah, the a few slides here just to wrap up and then we will have time for questions. We've been answering

47:22

a lot of them as we've as we're going. So, if you do have questions, uh, get them queued up and I will, uh, just share my screen here once I find find what screen to share. All right, there it is.

47:55

All right. So, just to kind of wrap up, uh, so some quick links if you are interested in learning more about MRA, you have that uh listed there. And some links for ROR, you know, you can find their their GitHub repo, their website, you can connect with them as well.

48:16

And if you, you know, one thing, if you want to do a little bit of audience participation, if you haven't already given Mastra a star on GitHub, please go do so. And you can join the Mastra cloud beta if you're interested in actually being able to deploy MRA agents or MRA workflows to our cloud service. And I guess James, you want to talk through this? Yeah. So, uh, for everyone that

48:41

joined this workshop, we're giving, uh, three months of unlimited usage, um, for anyone that signs up to work, just, um, on our on our startup plan to be clear, but you can just simply go to work.ai/master-workshop and just fill in the details and and we'll reach out to you. Yeah. Cool. So, for anyone building

49:01

wanting to build or monitor voice agents, that's a cool a a cool promotion. And if you want to get started with Maestra, just npm create Maestra and you can get started. But let's go ahead and open up everything for questions. So the last one on the list, is there a MRA ambassador program? I can uh I can talk

49:23

through this one. Not officially, but reach out to me and we can talk about it. We definitely are, you know, happy to encourage people to go talk at various events, uh various meetups, things like that. So definitely reach out. Uh let's see. Do you have an idea

49:40

of when the MRA client SDK will be able to stream to MRA? Are you talking uh Brandon in terms of voice or just streaming to like the front end with text voice? All right. Okay. Yeah. Go. Go ahead, Abby. Yeah. So, we actually have a PR uh right

50:08

now that takes the Okay, let me take a step back with OpenAI real time as well as any real-time model. You can connect via websockets or WebRTC. That's what the connection method does. When you're doing WebRTC, you want to do it from a client, uh maybe a browser. So, we have

50:27

a PR right now that we've been working on that allows the MSRA client SDK to create a WebRTC connection and then essentially talk to the real-time model. Um, and it's really fast and really nice. Um, so that is coming. We couldn't get it working for the workshop. So,

50:45

that's why it's not here. And I just want to make another point about voice in general. Um, so thanks for everyone to come for coming to this workshop. Um, as you can see, like Monsterra is not a

50:56

voice framework. We are an agent framework. Um, but voice is such an important part of our community and the AI industry or at least it's going to be. And so our kind of mentality is we want to bring agents that can do voice,

51:10

right? Not a voice framework, right? It's a different type of way of thinking. Everything we show today is

51:16

like our steps of building these Lego blocks, right? that eventually will come together into something that you could put in front of a phone or a call center or everything. So we are like this is our first work voice workshop. The next one will be even more developed, right? Imagine what we showed today was like

51:34

Legos and the next one will be like the Millennium Falcon or something, right? Um so just wanted to mention that. Yeah.

51:42

And to continue on that, Max has good questions. you know, integration with WebRTC, integration, you know, with something like Twilio, voice activity detection. Yes, exactly right, Max. Obby

51:53

kind of already answered that, but those are things that are coming. They're not necessarily fully uh baked today. You could, of course, roll a lot of this stuff yourself, right? You could build on the Legos that we have, but we want to make it even easier. So, there'll be

52:08

more to come in the near future. All right, let's see. Do we miss any other questions? We had some good ones that we answered along the

52:27

way. Pipecat is a bit hard for now. Yeah, I guess James or or Daniel, you you've I know you some of your customers are using Pipcad, right? Yeah. Yeah. um you know uh pip framework

52:41

for Python. I think anyone building in Python is defaulting to pipcat today. Um and so you know we really want to see mustra become the um pipat for typescript on the in the voice ai land. Yeah.

52:57

Yeah. Yeah. a lot of what we you know we we've looked at Pipecat quite a bit too just as we're evaluating how do we best build the right voice tools for our agents and so we we but yes it is I think voice is still early enough that there is you know once you get outside of kind of like the the some of the big

53:15

platforms it gets more complicated there's a lot of moving parts with voice right you have the streaming both ways you have you know a lot of different uh components and each one of those adds latency and it's a very latency sensitive you know thing. We're having a conversation. So, it is uh very early. All right. Do we have any other questions? Otherwise, we can get out of here. Give 10 minutes back to everybody.

53:38

But I I'll hold for another, you know, 60 seconds in case anyone else has other questions. And just to reiterate, the slide or the repo that we showed today will be available. It'll come out in an email. Usually it'll take me a few hours to get it out. So, you know, but it'll

54:00

come out sometime today. You will have the access to the video recording, the slides, the the actual code itself, and you know, we'll send you a link to any of the stuff that we kind of shared uh on here as well. All right, sorry, it's it's it's Frank. Sorry. Just wanted to ask you just a real quick one. Um it's a little bit of

54:22

a followup to kind of like would it be possible to create like a Fireflies like meeting notetaker? Um I have been talking to the folks like from like arcade and stuff like on the authentication side and authorization side. Um it it's just like it's not super clear how like that kind of thing would happen with like an agent that's like watching like my calendar or something like that. like just briefly

54:46

like obviously like the you know we only have a couple minutes but how how do like agents like like monitor and like watch processes or things that are coming in. Yeah, that so there's a lot of there's a lot of things that I think we can unpack there. Uh so a couple couple things right a lot of this stuff depending on

55:05

how it's built could be web hook driven meaning that an agent gets triggered at the end of a call. So, if you did, if let's say you just wanted to build a call recording for Zoom, well, I'm pretty sure Zoom has the ability to basically subscribe to at the end of a of when a call recording is available, you could pass that recording URL to an

55:23

agent that would then process that. So, that's one option. Um, you'll see even in this meeting, I think Obby has his noteaker. So, Fireflies, for instance, they actually do it a little bit differently. They're essentially joining

55:35

the call and then recording it as they, you know, as the call is going on. So they're getting a separate recording than what you would just get with like a Zoom cloud recording. So that becomes a little bit more complicated to actually pull off. There are some different tool providers that are starting to provide

55:54

better like authentication systems for you know MCP for instance where you could then connect to different services and your users of your app could maybe pass their you know OOTH credentials in and you could use those but it is still a very early space. So I think the only one I've really seen so far that kind of

56:13

works is Composeio. They have some MCP authentication that would allow you to do that. But most of even using uh MCP today where which is the best way the easiest way to get connected to tools is really more for personal use cases. You could probably really easily build a

56:29

personal not easily but you could more easily build a personal voice recorder once you wanted to build an app that allowed your users to you know connect their voice recorder. That's where the authentication still is very early and it's not really figured out yet. People have obviously figured it out, but it's very much uh you know, you have to

56:46

implement a lot of the features yourself. I do anticipate over the next couple months it's going to get a lot easier though. Awesome. Thanks. Thanks for the info. Um I'm I'm trying to like

56:57

sign up for like the the cloud hosted solution and I just had some follow-up questions. So, is there anyone I can like reach out to or kind of like schedule something? The the Mastra cloud? Yeah. Yeah. Yeah. Reach out to

57:11

me. My name I'll get you set up. Sounds good.

57:16

Thank you. All right. Any other last minute questions?

57:27

I just ask you for the record of last workshop if you have it just to have a reference for the for the G2 user to me. Thank you. Yep. Give me give me one second. I'll drop it in the the channel so anyone who wants access to it. We

57:41

don't we should we could we could just make them public. We we send it out to anyone that signs up, but I will uh just get you the link here. So, it's just unlisted.

57:54

There you go. Awesome. Thank you very much. All right, everybody. Well, thanks

58:02

James, thanks Daniel, thanks you John. Thanks everyone for attending and yeah hopefully we'll see you at some future events and learn continue learning more about AI agents, workflows, voice, all those kind of things. Catch you later. Yeah, thanks for having us. Awesome. Have a good one.

Workshop Hosts

Watch Recording

YouTube