WEBVTT

00:00.000 --> 00:11.000
And Yela, it has been five years since I've been trying to invite you every time,

00:11.000 --> 00:15.000
there is some reason, part of it is to blame to your British governments.

00:15.000 --> 00:18.000
Part of it to yourself, part of it to me, I don't know.

00:18.000 --> 00:19.000
Now let's blame the British.

00:19.000 --> 00:23.000
I've been trying to get you on my stage for five years now, and it's finally happening.

00:23.000 --> 00:24.000
Yay.

00:24.000 --> 00:26.000
So glad to hear.

00:26.000 --> 00:27.000
Thank you.

00:27.000 --> 00:36.000
And this will be our actual first AI talk of the day when probably our only one.

00:36.000 --> 00:37.000
Oh, wow.

00:37.000 --> 00:39.000
That's rare.

00:39.000 --> 00:41.000
Which is remarkable.

00:41.000 --> 00:45.000
Nanna, you will work now at Google to make us of the amazing AI Gemini,

00:45.000 --> 00:50.000
and you're talking about how to use MCP when you are writing go,

00:50.000 --> 00:53.000
well, sorry, when your AI is writing go for you.

00:54.000 --> 00:55.000
Excuse me.

01:01.000 --> 01:02.000
Okay, everyone.

01:02.000 --> 01:04.000
So, I'm the Nella Petrus Alayek.

01:04.000 --> 01:06.000
I'm a developer-relation engineer.

01:06.000 --> 01:11.000
It's kind of refreshing to see a conference that you don't have so many AI talks,

01:11.000 --> 01:14.000
but as you can see, I need to talk about one.

01:14.000 --> 01:20.000
So, before I get started, who here knows what an MCP is?

01:20.000 --> 01:22.000
Okay, this will be so easy.

01:23.000 --> 01:24.000
Amazing.

01:24.000 --> 01:29.000
So, in this talk, I want you to first, I'll do a brief introduction,

01:29.000 --> 01:34.000
just to get things started and talk a little bit about motivation.

01:34.000 --> 01:37.000
I'll probably, because most of you are aware, I won't spend too much time playing

01:37.000 --> 01:38.000
what MCP is.

01:38.000 --> 01:40.000
That's not the important part.

01:40.000 --> 01:46.000
And then I want you to go to the meetup feature, which is talking about this experimental MCP server,

01:46.000 --> 01:48.000
which I call go doctor.

01:49.000 --> 01:52.000
Discuss a little bit of results and best practices,

01:52.000 --> 01:54.000
I learned along the way.

01:54.000 --> 01:57.000
So, a little bit of all of myself, so I'm based in UK,

01:57.000 --> 02:00.000
and the botanical origin from Brazil.

02:00.000 --> 02:03.000
So, if you find my accent a bit weird, that's the reason.

02:03.000 --> 02:07.000
That's also the reason why my brains have that this time of the year,

02:07.000 --> 02:09.000
because winter.

02:09.000 --> 02:15.000
So, anyway, previously before being in that row, as a back-end date engineer,

02:15.000 --> 02:20.000
currently obsessed with AI, this might be an popular opinion in this room,

02:20.000 --> 02:23.000
given the composition of this house.

02:23.000 --> 02:27.000
But my reason is, I've been 20 years in industry,

02:27.000 --> 02:31.000
and I cannot stand writing yet another crowd API.

02:31.000 --> 02:33.000
That's the simple thing.

02:33.000 --> 02:38.000
I find it super refreshing that I have different things to do now.

02:38.000 --> 02:39.000
That's the reason.

02:39.000 --> 02:42.000
So, if you want to talk about other stuff during coffee,

02:42.000 --> 02:47.000
or anti-games and men cats are always my favorite subjects.

02:47.000 --> 02:51.000
Okay, terms of introduction, it's very brief.

02:51.000 --> 02:54.000
I just want to invite you to think about this for a moment.

02:54.000 --> 02:59.000
We are entering the year of known deterministic software engineering.

02:59.000 --> 03:02.000
I've been thinking this for a while,

03:02.000 --> 03:04.000
because why is known deterministic?

03:04.000 --> 03:07.000
We are asking Asians to do the code first.

03:07.000 --> 03:13.000
And you might not get what you want.

03:13.000 --> 03:16.000
So, we're using to have tools that say,

03:16.000 --> 03:19.000
I will write this code, I'll compile, and yeah,

03:19.000 --> 03:21.000
either works or doesn't work, but if you compile again,

03:21.000 --> 03:25.000
either works or doesn't work again, the same result is deterministic.

03:25.000 --> 03:28.000
But if you give the same prompt to a nation,

03:28.000 --> 03:32.000
you do not have any guarantees that this code will be exactly the same.

03:32.000 --> 03:35.000
This is what I'll talk about on the deterministic.

03:35.000 --> 03:39.000
And there's a podcast from Martin Fowler,

03:39.000 --> 03:41.000
and the pragmatic programmer. He talks a lot about this.

03:41.000 --> 03:44.000
I recommend you to listen to that.

03:44.000 --> 03:48.000
But I added a starter. There's a reason.

03:48.000 --> 03:53.000
Are we really getting into the known deterministic software engineering now?

03:53.000 --> 03:58.000
Or should remind us that human beings are known deterministic?

03:58.000 --> 04:02.000
And we've been writing software for a hundred years now.

04:02.000 --> 04:04.000
Yeah, probably not that long.

04:04.000 --> 04:08.000
But so the point is, we as human beings,

04:08.000 --> 04:11.000
without we have our teams, and sometimes they say, yeah, do this.

04:11.000 --> 04:14.000
And then some people go, oh, build, go test.

04:14.000 --> 04:16.000
Go link, deploy, and everything works well.

04:16.000 --> 04:19.000
And then next time you do the same thing and go build,

04:19.000 --> 04:23.000
forgot to run the test, deploy, and then test a broken.

04:23.000 --> 04:25.000
So this is the known deterministic experience.

04:25.000 --> 04:29.000
So we try to give tools to humans, like CICD,

04:29.000 --> 04:31.000
to have deterministic outcomes.

04:31.000 --> 04:35.000
When thinking about agents, that's the thing we need to do.

04:35.000 --> 04:37.000
Agents are known deterministic.

04:37.000 --> 04:40.000
How can we improve the response quality?

04:40.000 --> 04:42.000
We need to give them deterministic tools.

04:42.000 --> 04:45.000
That's the message here.

04:45.000 --> 04:48.000
Other starter, yeah, there are a lot of exceptions

04:48.000 --> 04:51.000
because I don't want, I'm very non-commitable to the things I say.

04:51.000 --> 04:54.000
But the thing is, yeah, not necessarily all the tools

04:54.000 --> 04:56.000
they're going to give to our agent will be deterministic.

04:56.000 --> 05:00.000
But the idea is to reduce the degree of freedom of the agents.

05:00.000 --> 05:03.000
So it don't derail that much.

05:03.000 --> 05:07.000
If you know you're going to do go build, go test, go link, deploy.

05:07.000 --> 05:10.000
Do not give the agent the chance to think.

05:10.000 --> 05:15.000
Give the tool that does build, test, link, and deploy.

05:15.000 --> 05:17.000
So that's the baseline.

05:17.000 --> 05:20.000
So if you've got this from this talk, this is everything

05:20.000 --> 05:21.000
you need to know.

05:21.000 --> 05:22.000
But let's see a little bit more.

05:22.000 --> 05:24.000
So model console protocol.

05:24.000 --> 05:26.000
I will not spend too much time on this

05:26.000 --> 05:28.000
because most of the people are aware.

05:28.000 --> 05:32.000
It protocols, standardized, how to provide context

05:32.000 --> 05:35.000
to tools and resources to LLMs.

05:35.000 --> 05:38.000
It's a built by initially banned traffic.

05:38.000 --> 05:46.000
Now it's on collected by some organization.

05:46.000 --> 05:50.000
Building blocks, tools, prompts, resources.

05:50.000 --> 05:53.000
Tools are the things we give to the models

05:53.000 --> 05:55.000
so they can count, interact with the world.

05:55.000 --> 05:56.000
Prompts.

05:56.000 --> 05:59.000
It's a matter of you storing a prompt template

05:59.000 --> 06:00.000
and server size.

06:00.000 --> 06:01.000
So you don't need to remember that.

06:01.000 --> 06:02.000
You can.

06:02.000 --> 06:04.000
And depending on the client, they will access in different ways.

06:04.000 --> 06:06.000
I use a lot of Jim and I say live.

06:06.000 --> 06:07.000
Use cloud.

06:07.000 --> 06:09.000
They usually map the slash commands.

06:09.000 --> 06:12.000
For example, resources are for reading data.

06:12.000 --> 06:14.000
So they have all the credits.

06:14.000 --> 06:18.000
I'm going to only be focusing on tools for the stock.

06:18.000 --> 06:20.000
And since this is the goal that we're only

06:20.000 --> 06:22.000
doing, most important thing in each node.

06:22.000 --> 06:25.000
There's an official GoSDK for MCP.

06:25.000 --> 06:28.000
Honestly, I was thinking about what should they put in this line.

06:28.000 --> 06:30.000
You're always good enough.

06:30.000 --> 06:32.000
This is amazing.

06:32.000 --> 06:34.000
It just works.

06:34.000 --> 06:35.000
There's nothing to say.

06:35.000 --> 06:37.000
It works pretty well.

06:37.000 --> 06:39.000
So go doctor.

06:39.000 --> 06:41.000
What is go doctor?

06:41.000 --> 06:43.000
So we need to go back a little in the past.

06:43.000 --> 06:47.000
So not to early 2000s, but just six months ago.

06:47.000 --> 06:50.000
I was starting to learn about MCP.

06:50.000 --> 06:52.000
And this was the generation.

06:52.000 --> 06:54.000
I was talking about playing.

06:54.000 --> 06:56.000
I just joined Google at that time.

06:56.000 --> 06:58.000
I was playing with Gemini 2.5.

06:58.000 --> 07:03.000
And I was actually trying to use Gemini CLI to create

07:03.000 --> 07:05.000
MCP server.

07:05.000 --> 07:08.000
And it was so frustrating.

07:08.000 --> 07:10.000
Because so many things happen.

07:10.000 --> 07:11.000
So yeah, use this.

07:11.000 --> 07:14.000
And it gives you a row and it doesn't read your row.

07:14.000 --> 07:16.000
I want you to use this package.

07:16.000 --> 07:17.000
It doesn't read it.

07:17.000 --> 07:19.000
Go do something else.

07:19.000 --> 07:20.000
Oh.

07:20.000 --> 07:23.000
And I want to use this package and say,

07:23.000 --> 07:24.000
Oh, yeah.

07:24.000 --> 07:25.000
Okay.

07:25.000 --> 07:27.000
I'm going to start hallucinating all the APIs.

07:27.000 --> 07:29.000
And say, I said, blow the hell.

07:29.000 --> 07:30.000
What's going on?

07:30.000 --> 07:32.000
So I started designing this.

07:32.000 --> 07:35.000
We say, okay, we need to fix that.

07:35.000 --> 07:39.000
And so the goal is to instead of having generacles like

07:39.000 --> 07:42.000
write files, edit files.

07:42.000 --> 07:46.000
I don't know.

07:46.000 --> 07:49.000
Any other run shell command.

07:49.000 --> 07:51.000
I want to give specialist goals.

07:51.000 --> 07:54.000
I wanted tools that knew what goal was about.

07:54.000 --> 07:58.000
And I also could add some intelligence to it.

07:58.000 --> 08:01.000
So the goals, reduce hallucinations, improve quality,

08:01.000 --> 08:05.000
talking efficiency, response time, out the good stuff.

08:05.000 --> 08:08.000
And this was the first toy came up with.

08:08.000 --> 08:13.000
Which might sound silly, because you're one shell command away to run

08:13.000 --> 08:14.000
Goldock.

08:14.000 --> 08:19.000
Like I wanted the model to have an easy access to your documentation.

08:19.000 --> 08:23.000
So my hypothesis was, if you just read the damn docs,

08:23.000 --> 08:26.000
you would not hallucinate this that much.

08:26.000 --> 08:27.000
Right?

08:27.000 --> 08:31.000
And actually this worked pretty well.

08:31.000 --> 08:36.000
So I made a distinct wrapper in the Goldock and gave to

08:36.000 --> 08:38.000
Gemini to those five at that moment.

08:38.000 --> 08:40.000
And it started working well.

08:40.000 --> 08:42.000
So that was very inspiring.

08:42.000 --> 08:47.000
And notice, Gemini 3 is much better at doing this by default.

08:47.000 --> 08:50.000
It doesn't really need this kind of tool.

08:50.000 --> 08:52.000
Because if you try to do something in Gemini 3,

08:52.000 --> 08:54.000
it will automatically think, oh yeah,

08:54.000 --> 08:55.000
I need to read the documentation.

08:55.000 --> 08:57.000
We will run the shell command with Goldock automatically.

08:57.000 --> 08:59.000
So it got much better.

08:59.000 --> 09:03.000
But let me show just a very quick demo here.

09:04.000 --> 09:06.000
So show what is the kind of experience.

09:06.000 --> 09:08.000
So I have docs here.

09:08.000 --> 09:10.000
Let's call G3.

09:10.000 --> 09:15.000
This is just an alias to Gemini's ally with a Gemini flash preview.

09:15.000 --> 09:17.000
I'm using some model here.

09:17.000 --> 09:21.000
And I can say, like here, read the docs for,

09:21.000 --> 09:24.000
let's see, let's try something.

09:24.000 --> 09:29.000
Goldock, Goldock,

09:29.000 --> 09:33.000
Gemini, for example.

09:33.000 --> 09:35.000
And this will call there.

09:35.000 --> 09:37.000
So here it is.

09:37.000 --> 09:38.000
Simple as that.

09:38.000 --> 09:40.000
Call the read docs on the Goldock.

09:40.000 --> 09:42.000
The recipe server.

09:42.000 --> 09:46.000
And why is this, is this better than running a shell command?

09:46.000 --> 09:49.000
It kind of is because if I do the same thing,

09:49.000 --> 09:53.000
Goldock, Google,

09:53.000 --> 09:56.000
GoLang.org, the GenNI.

09:57.000 --> 10:00.000
It will complain that no module was installed.

10:00.000 --> 10:03.000
And at that point, you need to say, oh, yeah, by the way,

10:03.000 --> 10:06.000
I need to do go mod in it, docs.

10:06.000 --> 10:11.000
And then I need to say, go get, you got the idea, right?

10:11.000 --> 10:13.000
And then I can call Goldock.

10:13.000 --> 10:15.000
So as soon as I tried to make a tool,

10:15.000 --> 10:20.000
that it was more easy for the model to use.

10:20.000 --> 10:21.000
So I've tracked all this.

10:21.000 --> 10:23.000
I create a temporary directory.

10:23.000 --> 10:26.000
And initialize a fake module, download the thing,

10:26.000 --> 10:28.000
and then read the docs.

10:28.000 --> 10:31.000
And this is where I'm going with the stock is,

10:31.000 --> 10:33.000
the more you can automate those steps and make it

10:33.000 --> 10:36.000
friendly for the model, the better it will be.

10:36.000 --> 10:41.000
So I'm not going to double too much on this.

10:41.000 --> 10:43.000
So this was the first tool.

10:43.000 --> 10:46.000
And then it thought, oh, yeah, what else can I improve?

10:46.000 --> 10:47.000
You know what?

10:47.000 --> 10:49.000
The code quality is not where it is.

10:49.000 --> 10:50.000
Not idiomatic.

10:50.000 --> 10:51.000
We are a gold developer.

10:51.000 --> 10:54.000
It's a love idiomatic, right?

10:54.000 --> 10:58.000
And so I thought, and if I just write my code with the model

10:58.000 --> 11:01.000
and say, review the code now, it's kind of like,

11:01.000 --> 11:05.000
you're giving the key to the jail for the benefit.

11:05.000 --> 11:07.000
Like, he just wrote the code, so it will say, oh, yeah,

11:07.000 --> 11:09.000
this code is amazing.

11:09.000 --> 11:13.000
So the idea is to give to a second model with zero context

11:13.000 --> 11:16.000
and just like one problem and say, evaluate this.

11:17.000 --> 11:20.000
And then you have like a more unbiased view.

11:20.000 --> 11:23.000
And this was also fairly successful too.

11:23.000 --> 11:25.000
And I say, oh, yeah, okay.

11:25.000 --> 11:27.000
I feel like I'm into something.

11:27.000 --> 11:28.000
So what else can I do?

11:28.000 --> 11:30.000
What else can we improve?

11:30.000 --> 11:34.000
And I got to start playing with it and say, yeah,

11:34.000 --> 11:38.000
the replace tool, like editing file is very brittle in nations.

11:38.000 --> 11:41.000
They hallucinate things that are not there.

11:41.000 --> 11:43.000
They have white space problems.

11:43.000 --> 11:46.000
They need to do typos when they write stuff,

11:46.000 --> 11:48.000
like some weird formatting issues,

11:48.000 --> 11:51.000
hallucinated dependencies, sometimes they build,

11:51.000 --> 11:54.000
but they break the test or they fix the test,

11:54.000 --> 11:56.000
break the build or whatever.

11:56.000 --> 11:59.000
And so the effect of the code is called so many things.

11:59.000 --> 12:02.000
The more I use the more I've seen those patterns.

12:02.000 --> 12:04.000
So I start creating a tool for it.

12:04.000 --> 12:07.000
And I made some early experiments.

12:07.000 --> 12:08.000
And because you know,

12:09.000 --> 12:12.000
the name came from GoDoc at R and say,

12:12.000 --> 12:13.000
GoDoc, oh, yeah.

12:13.000 --> 12:16.000
But I also have the figure of the doctor at all.

12:16.000 --> 12:17.000
This is cute.

12:17.000 --> 12:20.000
So let's create more tools that are deducted to teams.

12:20.000 --> 12:22.000
So I created like scalpel, scribble,

12:22.000 --> 12:23.000
because you know,

12:23.000 --> 12:25.000
doctor writing is ungrittable.

12:25.000 --> 12:27.000
Endoscope.

12:27.000 --> 12:30.000
And this was very cute conceptually,

12:30.000 --> 12:32.000
but it's terrible for the model.

12:32.000 --> 12:33.000
Because it looked at the scalpel.

12:33.000 --> 12:35.000
I don't know how to use that.

12:35.000 --> 12:36.000
I use edit file.

12:37.000 --> 12:43.000
So quick lessons learned is that you need to sell the tool to the model.

12:43.000 --> 12:47.000
You know, you need to convince it to use the tool.

12:47.000 --> 12:51.000
So and a scalpel was a very precise tool,

12:51.000 --> 12:52.000
because I thought, you know what?

12:52.000 --> 12:54.000
This replacing doesn't work.

12:54.000 --> 12:56.000
I want to give you line and columns.

12:56.000 --> 12:57.000
So from this line,

12:57.000 --> 12:59.000
this column added this to this line,

12:59.000 --> 13:01.000
this column didn't work.

13:01.000 --> 13:04.000
It could not simply do simple column math.

13:05.000 --> 13:08.000
The doctor time naming didn't work.

13:08.000 --> 13:10.000
If the behavior is too similar,

13:10.000 --> 13:12.000
I had a simple file edit tool.

13:12.000 --> 13:14.000
It never used because, oh yeah,

13:14.000 --> 13:15.000
I have replaced.

13:15.000 --> 13:16.000
I don't need to use.

13:16.000 --> 13:17.000
I'm building tool.

13:17.000 --> 13:19.000
I'm familiar with my building tool.

13:19.000 --> 13:21.000
So I'll not try a different tool to that.

13:21.000 --> 13:25.000
And if you don't do much more than just running a show command,

13:25.000 --> 13:28.000
this is very hard for the model to decide.

13:28.000 --> 13:31.000
But this took me months,

13:31.000 --> 13:33.000
because I would create a tool.

13:33.000 --> 13:36.000
Now my Q10 CPU server and start coding and say,

13:36.000 --> 13:37.000
oh wow.

13:37.000 --> 13:41.000
And all this results were pretty empirical.

13:41.000 --> 13:42.000
You know what?

13:42.000 --> 13:44.000
I need to be more professional in this thing.

13:44.000 --> 13:47.000
So enter this experimentation era.

13:47.000 --> 13:49.000
And this is, in my past,

13:49.000 --> 13:51.000
when I used to be a data engineer,

13:51.000 --> 13:54.000
one of my favorite projects was a recommendation system.

13:54.000 --> 13:57.000
And we did a lot of AB testing.

13:57.000 --> 13:58.000
And the idea of how AB testing is like,

13:58.000 --> 14:00.000
you'll have a massive amount of data.

14:00.000 --> 14:02.000
So you can start comparing.

14:02.000 --> 14:05.000
Oh, it's alternative A, better than alternative B.

14:05.000 --> 14:11.000
Can I really say that if I add gold author to my agent,

14:11.000 --> 14:15.000
is it better than not having it?

14:15.000 --> 14:18.000
Remember, like the models are known deterministic.

14:18.000 --> 14:19.000
Even if you give it,

14:19.000 --> 14:22.000
and this is, I love doing this when I do the demonstration.

14:22.000 --> 14:24.000
Because sometimes the time, if I hear,

14:24.000 --> 14:26.000
I'm doing one demonstration.

14:26.000 --> 14:29.000
If you choose that time to not do what I wanted to do,

14:29.000 --> 14:32.000
then I'll leave you with a bad impression, right?

14:32.000 --> 14:34.000
So I build this, which I call 10K,

14:34.000 --> 14:37.000
is like in this experimentation framework, this open source.

14:37.000 --> 14:41.000
And the idea is, I had building blocks

14:41.000 --> 14:44.000
that I could compose and create my experiments.

14:44.000 --> 14:48.000
And this experiment will run end amount of times.

14:48.000 --> 14:50.000
So I can have a decent sample

14:50.000 --> 14:53.000
and start doing a statistical analysis on top of it.

14:53.000 --> 14:57.000
This is a bit more easier to see,

14:57.000 --> 14:59.000
and I'm going to show the UI in a minute.

14:59.000 --> 15:02.000
So the idea, everything started with scenario.

15:02.000 --> 15:04.000
For example, create a hello world app.

15:04.000 --> 15:05.000
This is my scenario.

15:05.000 --> 15:09.000
Then I'm going to create, I have a few configuration blocks

15:09.000 --> 15:13.000
that are, for example, this is an agent with Gemini tree.

15:13.000 --> 15:16.000
This is an agent with Gemini tree, plus go doctor.

15:16.000 --> 15:19.000
This is an agent with Gemini tree, plus go please.

15:19.000 --> 15:20.000
And so on.

15:20.000 --> 15:22.000
So these are the configuration blocks.

15:22.000 --> 15:25.000
And I create the experiment and place with my alternatives.

15:26.000 --> 15:29.000
And one alternative is associated with certain configurations.

15:29.000 --> 15:35.000
For example, here, if I look, this is the UI.

15:35.000 --> 15:37.000
Sorry, I'm not a front end developer.

15:37.000 --> 15:41.000
This everything was bi-coded here, because I don't know.

15:41.000 --> 15:43.000
I wouldn't know JavaScript.

15:43.000 --> 15:48.000
So here I have a bunch of agents that have one is hard coded

15:48.000 --> 15:51.000
for Gemini tree flash, the other hard coded for a pro.

15:51.000 --> 15:54.000
This one is the original one with nothing.

15:54.000 --> 15:56.000
You know, no model is specified.

15:56.000 --> 15:58.000
I have a bunch of special system prompts.

15:58.000 --> 16:02.000
If I want to add or remove, have some special context files.

16:02.000 --> 16:06.000
Settings, if I want to disable the core tools.

16:06.000 --> 16:08.000
If I want to enable preview features,

16:08.000 --> 16:11.000
only show enable, no other tools and so on.

16:11.000 --> 16:14.000
And based on this, I have a bunch of scenarios as well.

16:14.000 --> 16:16.000
Like for example, the scenario I'm using the most is this

16:16.000 --> 16:19.000
create a recipe server to official go SDK.

16:19.000 --> 16:23.000
And this scenario has basically a prompt.

16:23.000 --> 16:27.000
And it has some assets.

16:27.000 --> 16:29.000
This one starts from scratch.

16:29.000 --> 16:31.000
So it's just a prompt.

16:31.000 --> 16:33.000
Some environment variables and I have validation criteria.

16:33.000 --> 16:35.000
I cannot hear like I want a link.

16:35.000 --> 16:37.000
I want to test. I want to run this commands.

16:37.000 --> 16:41.000
And it creates this part of the acceptance criteria of the prompt.

16:41.000 --> 16:45.000
So this is what I'm going to basically run the agent.

16:45.000 --> 16:47.000
Run the acceptance criteria and validate.

16:47.000 --> 16:51.000
If they have passed all the criteria, that's a success overall.

16:51.000 --> 16:55.000
And if you look at the templates, let's get one that's this one,

16:55.000 --> 16:57.000
go to the MCP server.

16:57.000 --> 17:00.000
So I have alternative one is my control.

17:00.000 --> 17:05.000
I have Gemini CLI, Original, no custom features.

17:05.000 --> 17:08.000
But then I have the go to the MCP as a second alternative.

17:08.000 --> 17:10.000
And this I have an option.

17:10.000 --> 17:12.000
They're configuring the MCP server.

17:12.000 --> 17:16.000
I can add other stuff here as long as it's configured in my library.

17:16.000 --> 17:18.000
I can add an extension on another server.

17:18.000 --> 17:20.000
Like go please for example.

17:20.000 --> 17:21.000
And so on.

17:21.000 --> 17:22.000
So this is the idea.

17:22.000 --> 17:27.000
And whenever I run an experiment, I get this nice UI here.

17:27.000 --> 17:31.000
So for example, this was an experiment

17:31.000 --> 17:35.000
I was comparing go doctor with the default alternative.

17:35.000 --> 17:40.000
And you see, the build of statistics is that I can probably see here that,

17:40.000 --> 17:45.000
oh yeah, the default only was successful 62% of the time.

17:45.000 --> 17:48.000
And go doctor was successful 76% of the time.

17:48.000 --> 17:51.000
And I can say, oh, go doctor is more successful.

17:51.000 --> 17:52.000
No, it's not.

17:52.000 --> 17:57.000
Because it doesn't have statistical significance.

17:57.000 --> 18:02.000
But the average duration has three stars of statistical significance.

18:02.000 --> 18:05.000
This means that this is very unlikely.

18:05.000 --> 18:08.000
This is due to random noise.

18:08.000 --> 18:13.000
So in this case, yeah, it's 30 seconds faster, roughly 30% faster.

18:13.000 --> 18:17.000
Using go doctor, this version of go doctor, which is an old one.

18:17.000 --> 18:20.000
And here have some determinant analysis where you can see.

18:20.000 --> 18:27.000
Oh, because go doctor uses marked edit with a significance of roughly 4%.

18:27.000 --> 18:30.000
It's as likely faster because of this too.

18:30.000 --> 18:34.000
So this is the kind of analysis you can do.

18:34.000 --> 18:37.000
Yeah, so I don't tons of experiments.

18:37.000 --> 18:39.000
And what happens?

18:39.000 --> 18:42.000
So yeah, oh.

18:42.000 --> 18:45.000
By the way, that, that, that, this architecture of 10,

18:45.000 --> 18:48.000
I'll get the experiments back in a moment.

18:48.000 --> 18:51.000
Or I would like to say that this is the architecture of 10,

18:51.000 --> 18:56.000
but right now it only works in my machine.

18:56.000 --> 18:59.000
And there's a reason for that.

18:59.000 --> 19:03.000
It's because the front end and this one then I just show this running on cloud run.

19:03.000 --> 19:07.000
This work amazing in the experiment that is called SQL works amazing.

19:07.000 --> 19:13.000
Everything else is working on storage and go doctor is running as an MCP server in cloud run.

19:13.000 --> 19:19.000
But I have running to a nasty bug on the runner part that every time I call go model needs it crashes.

19:19.000 --> 19:22.000
I don't know why I need to debug this.

19:22.000 --> 19:30.000
It will take a while, but the concept was really nice because you can have cloud run both for service and for jobs.

19:30.000 --> 19:34.000
So the idea is for each run for, so you have an sample size of 50.

19:34.000 --> 19:39.000
I just trigger 50 different containers and so do your job and collect the data.

19:39.000 --> 19:45.000
So at the moment, if you go to the open source repo, you only find the local version.

19:45.000 --> 19:48.000
This one is working progress, but I'm getting there.

19:48.000 --> 19:54.000
Okay, so I showed one of the experiments that was just showing, oh yeah, go doctor is kind of faster,

19:54.000 --> 19:57.000
but not necessarily more precise.

19:57.000 --> 20:04.000
But the idea is, instead of just guessing what I do is, I analyze my experiments,

20:04.000 --> 20:11.000
I'll serve it the model, I'm just typing some stuff and see, oh yeah, this model is struggling with go bond.

20:11.000 --> 20:15.000
What two can I design to help with this go bond problem?

20:15.000 --> 20:17.000
Then I develop my hypothesis.

20:17.000 --> 20:23.000
I'll create the two, run the experiment and say, you know, run this 50 times and run away.

20:23.000 --> 20:25.000
So I don't need to wait for two months doing this.

20:25.000 --> 20:29.000
I'll just run this, then I analyze this experiment.

20:29.000 --> 20:32.000
Did I find anything that statistically significant?

20:32.000 --> 20:34.000
Did I actually improve the flow?

20:34.000 --> 20:37.000
Yes or no, run it and repeat.

20:37.000 --> 20:39.000
That's the whole point of this framework.

20:39.000 --> 20:42.000
So a few results.

20:42.000 --> 20:48.000
I end up coming up when a new generation of tools and remember the selling, the thing to the model is important.

20:48.000 --> 20:51.000
So everything, I have a new tool, now I'm calling it smart.

20:51.000 --> 20:54.000
Because, oh, this is not just me, this is smart read.

20:54.000 --> 20:57.000
This isn't just edit, this is smart edit and so on.

20:57.000 --> 21:06.000
So this is kind of, but also you need to tune the, the MCP instructions and the description of the tool as well.

21:06.000 --> 21:08.000
Never do a simple operation.

21:08.000 --> 21:11.000
So I'm always combining several operations in the same row.

21:11.000 --> 21:13.000
I won't show a few of them in a minutes.

21:13.000 --> 21:16.000
Return additional context every time you can.

21:16.000 --> 21:25.000
So you end up spending more context window as a tradeoff for quality and speed, which end up leading to less context use.

21:25.000 --> 21:30.000
So you kind of preemptively show doing things in one more thing is in one turn.

21:30.000 --> 21:31.000
That's the goal.

21:31.000 --> 21:34.000
On failure provide hints for self-correction.

21:34.000 --> 21:38.000
That kind of like, you know, oh, you try to change this line.

21:38.000 --> 21:42.000
But did you actually mean that line and you give the hint back to the model?

21:42.000 --> 21:44.000
Or did you try to import this package?

21:44.000 --> 21:47.000
But actually the packages you have available are these ones.

21:47.000 --> 21:50.000
So lots of things like that you can do.

21:50.000 --> 21:53.000
And yeah, and the whole point is reducing the degree of freedom.

21:53.000 --> 21:57.000
So a few examples of tools, practically needs.

21:57.000 --> 22:04.000
This creates a project and a path initialized folders, downloads out the dependencies,

22:04.000 --> 22:07.000
and when it finished downloaded the dependencies,

22:07.000 --> 22:12.000
return the goal doc automatically for the model.

22:12.000 --> 22:16.000
So I kind of force, every time you get a new dependency,

22:16.000 --> 22:18.000
I force you the documentation.

22:18.000 --> 22:21.000
So you can, you have no opportunity for guessing.

22:22.000 --> 22:24.000
It's much read.

22:24.000 --> 22:30.000
The idea of smart read was to create a tool that will simulate the hover featuring IDs.

22:30.000 --> 22:33.000
So you get the typing information as well.

22:33.000 --> 22:36.000
There is no easy way to give typing information,

22:36.000 --> 22:38.000
they treat everything as text.

22:38.000 --> 22:40.000
So the idea was to show the file,

22:40.000 --> 22:43.000
and then have additional context session in the bottom that says,

22:43.000 --> 22:47.000
oh, by the way, if you have create user type,

22:47.000 --> 22:50.000
by the type user is defined as X.

22:51.000 --> 22:55.000
Again, add more context for reduce chance of destination.

22:55.000 --> 22:58.000
It also added a syntax check with the goal parser just to make sure,

22:58.000 --> 23:01.000
oh, yeah, this is a valid goal file or not.

23:01.000 --> 23:04.000
Smart edit is the one that I added most things,

23:04.000 --> 23:07.000
because this I was playing too much with the replays tool

23:07.000 --> 23:10.000
in Gemini, Celli, and Scanality.

23:10.000 --> 23:13.000
So first thing, because we have goal format and stuff,

23:13.000 --> 23:16.000
I added a normalization of white space before matching.

23:16.000 --> 23:19.000
You know, get rid of all the white space.

23:19.000 --> 23:23.000
And match only the content, because if you are hallucinated tab or a space,

23:23.000 --> 23:25.000
then you break everything.

23:25.000 --> 23:31.000
I also added some tolerance for typos with edit distance.

23:31.000 --> 23:36.000
So you use 11 time distance to up to two distance,

23:36.000 --> 23:40.000
two characters you can replace at will.

23:40.000 --> 23:43.000
And this is fine, because it will try to compile with breaks anything,

23:43.000 --> 23:44.000
it will fail.

23:44.000 --> 23:47.000
And a bunch of more things like to, yeah,

23:47.000 --> 23:50.000
run automatically, go format and go in parts,

23:50.000 --> 23:54.000
and prevent saving if introduced like a syntaxer,

23:54.000 --> 23:57.000
and add this hints system or say, oh, by the way,

23:57.000 --> 23:58.000
you did it.

23:58.000 --> 24:02.000
I didn't find the thing you are trying to replace,

24:02.000 --> 24:04.000
but the next best match is this one.

24:04.000 --> 24:06.000
It means this one.

24:06.000 --> 24:10.000
So it gives you, has some supports for line numbers,

24:10.000 --> 24:12.000
a few other things as well.

24:12.000 --> 24:14.000
And the last one is smart build,

24:14.000 --> 24:19.000
which is just simplifies like the pipeline.

24:19.000 --> 24:22.000
It does build test, linked in the same row,

24:22.000 --> 24:26.000
and does a bit of magic with more tidings, so on.

24:26.000 --> 24:29.000
So let's see if I can show some of this.

24:29.000 --> 24:32.000
So I have a vanilla example here,

24:32.000 --> 24:35.000
which this is,

24:35.000 --> 24:38.000
the prompted by the way is this one.

24:38.000 --> 24:41.000
So create a model context protocol server

24:41.000 --> 24:43.000
that spells Hello World 2.

24:43.000 --> 24:45.000
And the way I write prompts are usually like this,

24:45.000 --> 24:50.000
so I have over our context what I want to do.

24:50.000 --> 24:54.000
But then I break down to in a list of tasks

24:54.000 --> 24:58.000
to make sure everything I want is done.

24:58.000 --> 25:01.000
In this case, I'm already giving a hint to the model,

25:01.000 --> 25:04.000
which is expected documentation of the SDK,

25:04.000 --> 25:06.000
which most models,

25:06.000 --> 25:08.000
model models will comply,

25:08.000 --> 25:11.000
but the 2.5 generation of Gemini didn't,

25:11.000 --> 25:13.000
all the time.

25:13.000 --> 25:16.000
And the idea of having an acceptance criteria

25:16.000 --> 25:19.000
is also super important, so you can give the model,

25:19.000 --> 25:21.000
I'm actually done with the work.

25:21.000 --> 25:23.000
So let's see.

25:23.000 --> 25:26.000
It's running, so it's already in the link stages,

25:26.000 --> 25:29.000
so it's running our check here, you'll see.

25:29.000 --> 25:33.000
So yeah, it's doing kind of well,

25:33.000 --> 25:38.000
actually, it's using goal list to discover the package.

25:38.000 --> 25:40.000
It's great, it's hard to see.

25:40.000 --> 25:44.000
But it kind of seems to be behaving well enough,

25:44.000 --> 25:47.000
and just to show the difference with goal doctor.

25:47.000 --> 25:52.000
So G3 is the same prompt, but this time I'm hard,

25:52.000 --> 25:54.000
yeah.

25:54.000 --> 25:57.000
This time it has goal doctor enabled.

25:57.000 --> 26:01.000
So hopefully this will call the goal doctor in CP server,

26:01.000 --> 26:03.000
so it's calling project it needs.

26:03.000 --> 26:05.000
You can see, for example, here,

26:05.000 --> 26:07.000
by calling project it needs,

26:07.000 --> 26:09.000
not only created the module,

26:09.000 --> 26:12.000
but also I fit the documentation right away,

26:12.000 --> 26:15.000
so it doesn't have the opportunity to do anything.

26:15.000 --> 26:16.000
File created is also,

26:16.000 --> 26:19.000
it's the only tool I'm not calling smart create for now,

26:19.000 --> 26:23.000
but it also has some magic happening under hood.

26:23.000 --> 26:26.000
And yeah, in essence,

26:26.000 --> 26:30.000
just running those two side by side,

26:30.000 --> 26:32.000
yeah, this one is done.

26:32.000 --> 26:35.000
It seems to be working because, yeah,

26:35.000 --> 26:37.000
it's using Gemini.

26:37.000 --> 26:39.000
So you won't see much difference in accuracy,

26:39.000 --> 26:42.000
but you might see differences in time,

26:42.000 --> 26:45.000
which is, again, hard to measure if it's one,

26:45.000 --> 26:48.000
run like a, a guess each other,

26:48.000 --> 26:53.000
but you can measure if you run this 50 times.

26:53.000 --> 26:57.000
And this is my first try experimenting,

26:57.000 --> 26:59.000
and here I'm also comparing with goal please,

26:59.000 --> 27:01.000
which is the official MCP server we have.

27:01.000 --> 27:03.000
It's just an experiment now,

27:03.000 --> 27:07.000
but you can see the difference in success rate

27:07.000 --> 27:09.000
in this case was a bismo,

27:09.000 --> 27:11.000
like you have like 14% of success rate,

27:11.000 --> 27:15.000
versus 48% and this is statistical significant.

27:15.000 --> 27:20.000
Three stars, less than 1% chance of this being a random event.

27:20.000 --> 27:23.000
But why is was this so significant?

27:23.000 --> 27:26.000
I was running Gemini's ally with auto model,

27:27.000 --> 27:30.000
and auto model, if you don't know Gemini's ally,

27:30.000 --> 27:32.000
is like a mixet bags of models,

27:32.000 --> 27:34.000
it will randomly decide the model which to use,

27:34.000 --> 27:36.000
which includes Gemini 2.5.

27:36.000 --> 27:40.000
So this was a big problem with this experiment.

27:40.000 --> 27:41.000
So when I decided,

27:41.000 --> 27:44.000
then I decided to first Gemini 3 flash preview

27:44.000 --> 27:46.000
and I got this result.

27:46.000 --> 27:48.000
In this case, I'm adjusting for success rate

27:48.000 --> 27:51.000
to make sure every, I only count in the success,

27:51.000 --> 27:55.000
so runs, but you will see default fail 12 times.

27:55.000 --> 27:59.000
Go please fail 10 times and go doctor's station fail five times.

27:59.000 --> 28:01.000
Still we have conservative indication

28:01.000 --> 28:03.000
that the go doctor is better,

28:03.000 --> 28:05.000
but it's not statistically significant.

28:05.000 --> 28:07.000
I didn't have any stars there.

28:07.000 --> 28:09.000
That means either I will increase the example size

28:09.000 --> 28:12.000
and run this 200 times, maybe to get some difference.

28:12.000 --> 28:15.000
But I'm considered an equivalent at this point

28:15.000 --> 28:18.000
because the intelligence of the model compensates that,

28:18.000 --> 28:19.000
the lack of tooling.

28:19.000 --> 28:21.000
So the model in this case was much better.

28:21.000 --> 28:24.000
But what was significant was the duration.

28:24.000 --> 28:28.000
So roughly 30, like, yeah.

28:28.000 --> 28:32.000
Instead of 200 seconds, less than 140 seconds,

28:32.000 --> 28:35.000
so go doctor ended up being much faster.

28:35.000 --> 28:38.000
And use less, although use more input tokens,

28:38.000 --> 28:40.000
use less output tokens.

28:40.000 --> 28:42.000
Yeah.

28:42.000 --> 28:46.000
We have some other difference in internal protocols.

28:46.000 --> 28:50.000
Big star here is that this is a single scenario.

28:50.000 --> 28:54.000
It's just like create MCP service scenario.

28:54.000 --> 28:56.000
Why it's just this scenario specifically?

28:56.000 --> 28:59.000
Because the model is not familiar with MCP package

28:59.000 --> 29:03.000
and I want some green field where it had to discover the API.

29:03.000 --> 29:05.000
It could be any other project like it

29:05.000 --> 29:08.000
with an unknown API that will work.

29:08.000 --> 29:11.000
Future work is great more realistic scenarios

29:11.000 --> 29:14.000
for refactoring, testing, and so on than

29:14.000 --> 29:17.000
so it can do proper testing on those.

29:17.000 --> 29:21.000
And, yeah, reaching out to the end.

29:21.000 --> 29:23.000
Conclusions.

29:23.000 --> 29:26.000
When the writer MCP service tries not to do

29:26.000 --> 29:29.000
one-to-one mapping of your API as the case,

29:29.000 --> 29:33.000
think about from the model perspective what is useful.

29:33.000 --> 29:36.000
Documentation examples are gold variations.

29:36.000 --> 29:39.000
Every time I have the opportunity to insert that in my context,

29:39.000 --> 29:41.000
I will do.

29:41.000 --> 29:43.000
I'll go into your tools with hints upon failures.

29:43.000 --> 29:46.000
So instead of just saying, I failed, I failed because of X.

29:46.000 --> 29:49.000
What about you try Y? It's a better approach.

29:49.000 --> 29:52.000
And try to reduce the model degree of freedom,

29:52.000 --> 29:55.000
example, build to test.

29:55.000 --> 29:58.000
If you want to try to build your own MCP service,

29:58.000 --> 30:02.000
this takes you to a code lab and you can

30:02.000 --> 30:04.000
within some cloud credits when you do this,

30:04.000 --> 30:06.000
so you don't need to pay anything.

30:06.000 --> 30:08.000
And yeah, this is the same process to create

30:08.000 --> 30:10.000
the gold author itself.

30:10.000 --> 30:12.000
And that's it.

30:12.000 --> 30:14.000
If you want to connect with me on socials

30:14.000 --> 30:17.000
and the top materials, they are all there.

30:17.000 --> 30:19.000
Thank you!

30:19.000 --> 30:20.000
Thank you!

