WEBVTT

00:00.000 --> 00:16.200
Hello, I'm going to be talking about how you can do use AI, I think in a very nice use case

00:16.200 --> 00:27.440
to automate production fixes using open source tools and AI models, and so hopefully you'll

00:27.440 --> 00:32.600
get an idea, I'm going to do the demos, so you can see how you can do it today with all open

00:32.600 --> 00:33.600
source.

00:33.600 --> 00:40.960
I am Carlos, I am a principal scientist at Adobe Experience Manager, this is a content management

00:40.960 --> 00:46.800
system, and I've been contributing for a long time to open source, and people made me feel

00:46.800 --> 00:52.960
very old, this weekend, there's a lot of young kids here, and also part of the Google

00:52.960 --> 01:00.340
developer experts program, so before we start, who's here, here about Arco in general,

01:00.340 --> 01:11.760
Arco rollouts, some people, okay, well I'm not going to ask about Kubernetes, and who has

01:11.760 --> 01:21.560
who's doing canary deployment or some sort of, just a few people, and who believes AI is

01:21.560 --> 01:31.440
going to make or lives easier, okay, who the lives AI is a fail, and this is going to die,

01:31.440 --> 01:41.200
so, okay, I think you're going to be in trouble, but that's my opinion, but we'll see,

01:41.200 --> 01:50.120
so progressive delivery, you have probably heard this term today or before, and this

01:50.120 --> 01:57.400
is just a name to group that groups all this, deployment strategies that try to do progressive

01:57.400 --> 02:04.880
changes to your production or whatever, so you don't break everything at once, so a new

02:04.880 --> 02:14.280
versions get deployed, but they don't replace the system versions, it just goes progressively

02:14.280 --> 02:20.360
changing your infrastructure, your application, something like that, and then the interesting

02:20.360 --> 02:27.800
thing here is that you're evaluating this life production traffic, so you don't have, you

02:27.800 --> 02:33.680
don't rely on unit test integration test, but you rely on actual production data to decide

02:33.720 --> 02:40.680
whether your rollout is being successful or not. I'm all enough to remember when cross-strike

02:40.680 --> 02:46.240
2,5 of the internet, I went up first, so it is all the affected people with Windows,

02:46.240 --> 02:52.640
so we don't care, right? So, one of the interesting things, and I recommend everybody

02:52.640 --> 02:59.600
to read the root cost analysis of all these big outages, like Amazon, Microsoft, whatever,

02:59.840 --> 03:07.200
it's very interesting, I think it was here, yes, it says, now the new model, because they

03:07.200 --> 03:11.800
they play something, it broke everybody at once, and the new model is we're going to do

03:11.800 --> 03:16.840
rings, and every ring is going to be deployed, and then we're going to check if those

03:16.840 --> 03:21.280
rings are going fine and so on, so basically they're going to switch to a canary deployment

03:21.280 --> 03:27.880
of some configuration scripts that before they were doing just all as well, all at once.

03:27.880 --> 03:33.560
The people are getting a lot of interest on this, because they realize that otherwise

03:33.560 --> 03:39.920
they're breaking the internet. I'm going to skip this slide, actually, because you are

03:39.920 --> 03:48.200
already know all this stuff, and I thought directly to our rollouts, our rollouts provides

03:48.200 --> 03:53.880
these advanced deployment techniques, and it makes it really easy to use them on Kubernetes,

03:53.880 --> 03:59.080
so if you can do blue ring, canary, canary analysis, experimentation, experimentation

03:59.080 --> 04:04.960
you have in heravit is very cool that you can say, okay, I want to run this for two hours,

04:04.960 --> 04:12.720
five hours, impar a little, and then I can see what's the, if this work doesn't work,

04:12.720 --> 04:18.400
and then I can kill it afterwards, like things like, we use it for, or we're looking at

04:18.400 --> 04:25.400
using it for Java braids, okay, we want to run two paths, or 10 paths in production,

04:25.400 --> 04:31.800
and one polling experimentation, we're saying, okay, let's run this with Java 2125 and

04:31.800 --> 04:36.520
see if it works, and we just need to run it for a few hours to make sure, and then look

04:36.520 --> 04:43.720
at the logs and see if there was any issues there. The way our rollouts works, you have

04:43.720 --> 04:52.120
the rollout controller, and you have the, some objects in Kubernetes, the way you get

04:52.120 --> 04:56.280
traffic is through the ingress, as usual with Kubernetes and the service, and you're

04:56.280 --> 05:00.680
going to have two replicatsets on Kubernetes, one is the stable, whatever you're running

05:00.680 --> 05:08.160
today, one is the canary, is basically your new version of the image or configuration,

05:08.160 --> 05:13.520
and then the analysis run object is an instantiation of the analysis template, the analysis

05:13.520 --> 05:21.280
template is what says, what a good analysis is, what a good canary is. If the typical

05:21.280 --> 05:30.080
case is go to permissions, get this metric, if 500 errors is over 5% or 1% or whatever,

05:30.080 --> 05:37.400
it's a bad, it's a bad canary, so the analysis run checks that, and that's going to

05:37.400 --> 05:42.840
is going to decide whether the canary keeps moving forward or the canary is rolled by.

05:42.840 --> 05:51.760
I'm here in the analysis template, you can decide, I want to do it 1%, 5% and 20% whatever

05:51.760 --> 05:57.400
sort of analysis you want to do, you define it on an analysis template, and you have a bunch

05:57.400 --> 06:03.760
of plugins, metric plugins, the permissions, you can run just a Kubernetes job that will

06:03.760 --> 06:09.000
decide whether your thing is, there canary is good or bad, and the one I added here is

06:09.000 --> 06:17.640
one to do it with AI analysis with an LLM. So, we are now going, then the demo I'm going

06:17.640 --> 06:21.960
to show you is, you have the source code, you deploy, you do the canary, you do the canary

06:21.960 --> 06:28.840
analysis, now an LLM is going to decide whether this canary is good, and then it's going

06:28.840 --> 06:35.920
to promote it as usual, or if it's bad, and then it's going to roll it back. So, our

06:35.920 --> 06:42.760
integral talks to this plugin, and this plugin will do, and this is an agent using

06:42.760 --> 06:48.120
a2A, that we'll look at your Kubernetes with whatever tools you want to create. I have

06:48.120 --> 06:54.760
an example agent that does the list pods, get pods, get logs, get metrics, get events,

06:54.760 --> 07:00.680
synchronaries, and so on, and that's get from into the LLM, and then LLM will tell you

07:00.680 --> 07:11.160
exactly what the problem is, for us we. Now bear with me for a second, because what happens

07:11.160 --> 07:19.280
after this, I don't want to have to go a look at what the problem is, what I'm doing

07:19.280 --> 07:27.920
is my plugin will do, besides the rollback of the canary, it will create a GitHub issue, it

07:27.920 --> 07:35.040
will assign an agent, an AIA code agent, that we will generate my pull request, and the

07:35.040 --> 07:42.520
pull request for me, that I can only have to approve, and then I go back, or I can just say,

07:42.520 --> 07:47.840
I mean, this will take a bit more work, but you can say, automatically immerse it, and

07:47.840 --> 08:00.720
then you get the full loop, absolutely, ultimately, and this works, 500% of it. So, that's

08:00.720 --> 08:08.640
what we're going to end doing. So, we have Argo here running, I put this on the top,

08:08.640 --> 08:14.400
because probably otherwise on the back of your note, I'm going to say it, okay. So, I have

08:14.440 --> 08:24.960
my green, my, my application is showing green, is responding with green color, right?

08:24.960 --> 08:30.960
If I, I only have one pod here for the sake of the demo to not bother too much, but if I

08:30.960 --> 08:39.200
say blue, if I do an upgrade to blue, I'm running a new canary with a new pod that is going

08:39.280 --> 08:46.800
to be blue, and here we should be, uh, so many of we'll see some blue dots, which means my, the,

08:46.800 --> 08:54.240
the traffic is going to my new version. So, and the canary is going to do the analysis, and

08:54.240 --> 09:04.400
is going to decide whether this new blue implementation is going to be fine or not, and let's

09:04.400 --> 09:17.680
see why it's not getting anything yet. So, the analysis is running here, so we should be getting

09:17.680 --> 09:23.520
some blue, why it's not getting some blue, obviously, always the demos are not working.

09:24.480 --> 09:42.160
That is blue, I might run them right now. Okay, something is wrong, because I'm not getting any blue.

09:42.240 --> 09:58.160
Yeah. All right. Yeah. Let's see, nothing like the back in life.

10:00.560 --> 10:03.360
The other pod is running and going to traffic.

10:06.960 --> 10:10.560
Yeah. Let's go back to green.

10:12.240 --> 10:30.560
I have a preview, also here, that is also not working. So, basically, the two services are

10:30.960 --> 10:35.040
not ritual for some reason. Let's see, okay.

10:39.200 --> 10:40.880
The canary is not working.

10:45.120 --> 10:48.240
Why is this not? There is.

10:50.560 --> 10:56.000
For some reason. So, now this table is the other way around, the table is the blue, the canary is the green,

10:56.080 --> 11:00.080
and eventually it's all going to be green if nothing breaks again.

11:02.800 --> 11:12.160
And we'll take a look at the analysis. Analysis, round, I'll make it a bit bigger.

11:15.200 --> 11:19.200
Yeah, it's the number of requests you're going to be blue and green. So, now it's all green.

11:19.200 --> 11:28.000
So, it's all green and this is a successful analysis. I can go and look at the previous one,

11:28.000 --> 11:39.040
two minutes ago. Okay, can describe the object. And on the analysis, round, I have Gemini

11:39.040 --> 11:46.080
to five flash analysis. It tells me, basically, my instructions are, tell me, look at this, tell

11:46.080 --> 11:53.840
me if it's, if I should promote it or not, how confident you are, and it gives me some, I think.

11:53.840 --> 11:58.400
So, I'm going to read it here. Confidence 95, promote it through remediation,

11:58.400 --> 12:02.400
not remediation steps are required as the canary, the plymia perish healthy.

12:02.400 --> 12:06.880
Not what goes for a failure, the canary, the plyming is operating as expected.

12:06.880 --> 12:11.280
I have a multimodal analysis here, because I'm also running Gemma.

12:11.600 --> 12:17.600
So, I'm running this against Gemini and Gemma. Gemma is running an IGPU inside this Google

12:17.600 --> 12:25.280
Kubernetes cluster, and that's not working that well yet, but it's doing both at the same time.

12:26.160 --> 12:32.640
So, Gemini is telling me, the stable parts show green responses, the canary pulse show blue responses,

12:33.600 --> 12:37.600
as per the instructions that you can also in the analysis template, you can say,

12:37.600 --> 12:42.160
give excellent instructions, and my instructions is ignore color changes, because sometimes

12:42.160 --> 12:48.320
it's like, oh, it's red, red looks bad. I will roll white. So, as per the instructions,

12:48.320 --> 12:51.760
color changes out to be, no, blah, blah, blah. So, everything is fine.

12:53.760 --> 13:02.560
And I'm going to do now a bad one. We'll see, hopefully, how this works.

13:02.960 --> 13:09.960
So, this new version is going to respond with random colors. Yes?

13:11.960 --> 13:19.960
Can I ask the question, why is your system doesn't show the programs that you actually sold by yourself?

13:19.960 --> 13:26.960
I didn't solve anything, it just started working here. It doesn't solve it, I'll tell you now.

13:27.520 --> 13:35.040
Let me, so now I'm getting random colors, which is what I want it, and is going to do the analysis,

13:35.040 --> 13:39.920
and the analysis is already failed, the grade it, and it's going to roll it back automatically, this is what

13:39.920 --> 13:47.200
Argadas. Now, why the previous thing didn't work, the previous thing was an issue with the routing.

13:47.200 --> 13:53.040
It was not an issue with a new part, not working or not working, so there was something in the routing

13:53.440 --> 13:59.120
that for some reason failed, and after I deployed again, it started working, I don't know what to happen,

13:59.120 --> 14:04.560
that it happened before, trust me. Other things happened before, but that one, no.

14:07.840 --> 14:12.880
Only, yeah, the Argadas, the way Argadas works is you have all parts, you have new parts,

14:12.880 --> 14:18.320
and you can look at anything you want there, but it's not going out to check if your routing

14:18.320 --> 14:23.920
is working, it's not going out through the internet or anything like that. It's just all version,

14:23.920 --> 14:30.560
new versions, and you compare, you cannot, it's not that you can check the new version, but you can

14:30.560 --> 14:36.880
also compare both versions. One thing we are doing is I don't care, for instance, you could say

14:36.880 --> 14:44.560
500 errors are bad, but if the old version also had 500 errors, what do you do? So what we then

14:44.560 --> 14:52.160
doing is I don't care if I have 500 errors as long as they are not 5% more than the previous version.

14:52.160 --> 14:58.160
You can do these things playing with both versions at the same time. So you saw this thing failed,

14:58.160 --> 15:07.920
it was rolled back, the analysis failed. Yes. Now, I'm going here to my repo,

15:07.920 --> 15:16.480
and one minute ago, my system created an issue, the analysis says that stable part is consistently

15:16.480 --> 15:23.760
serving requests with green responses, healthy blah blah blah. The canary part is frequently

15:23.760 --> 15:30.400
experiencing panic-servant amount of error. So it went autonomously, check the logs,

15:30.400 --> 15:36.880
describe the parts, whatever, I don't care. So these panics are critical application errors

15:36.960 --> 15:43.360
and not merely aesthetic differences. What do I do with this issue? This issue is already very good.

15:43.360 --> 15:50.240
Like I can go here and tell somebody, hey, the canary broke, here is why it broke.

15:51.440 --> 15:59.120
But the next step is I tell my AI plugin when it creates the issue, it labels it with

15:59.440 --> 16:07.920
Jules. Jules is a Google bot for writing code, you can use Copilot, you can use

16:09.280 --> 16:16.800
what is the other one the code. Basically, once you have an issue, it's very easy to

16:16.800 --> 16:25.520
head one of these coding agents to just work on it. So Jules is on it, and Jules is doing

16:25.920 --> 16:35.440
its magic here, and it's going whatever. It has the instructions, it gets the logs, it has the

16:35.440 --> 16:43.040
logs, it's working somewhere, and it's figuring out what to do, and it is, I think it's already,

16:43.040 --> 16:52.320
yeah, it's run the full test rates, test pass is long everything, right? And it should,

16:52.960 --> 17:01.440
so this was two minutes ago, in two minutes I have a PR, and let's go to my pull requests.

17:04.320 --> 17:09.680
Well, it's not there yet, but the previous one I can show you, it's going to show up in a bit,

17:11.520 --> 17:18.720
and it's, so it created this commit removes the intentional panic, so it figured out

17:18.720 --> 17:26.000
how to intentionally put it in, because like, yeah, you're not that dumb, just, you intentionally

17:26.000 --> 17:32.800
put this thing in there, so, and it just removed the whole thing. Well, it went, I put it here,

17:32.800 --> 17:39.120
if just right to access a great panic, because I'm accessing an index that doesn't exist,

17:39.120 --> 17:47.600
anyway, like, this is all, it should remove everything, right? So, and it's very cool, like,

17:47.920 --> 17:54.960
Jules is reporting for the duty, and you can talk to it, and then I also have Gemini Colorsist,

17:54.960 --> 18:00.400
another one, to review the PR that Jules wrote. So, it's telling me, yes,

18:01.040 --> 18:09.520
results are critical issue eliminating at deliberately introduce panic and clean up the, and you see,

18:09.520 --> 18:20.560
I'm Bob's. So, now, if I go back here, is it not yet? Not yet, okay, so it's still working there somewhere.

18:22.560 --> 18:31.520
All the steps completed, ready for submission. So, in a few minutes, I have a PR to fix my issue.

18:31.520 --> 18:43.760
I can just merge it, and this whole process starts again. And, okay, let's see, this one,

18:45.600 --> 18:53.280
is it here, there it is, open now, and again, this is, every time you get a different description,

18:53.280 --> 18:58.000
or whatever, that's a mother, a removing the intentional panic in the sort of range,

18:58.000 --> 19:03.680
blah, blah, blah, blah, blah. Let's see, what changes sometimes it makes different changes,

19:03.680 --> 19:10.960
this time may the same changes before, and, and that's it, I don't have to do anything.

19:12.240 --> 19:22.480
So, rolling, just to wrap it up, so rolling out changes to all the uses at once is risky,

19:22.480 --> 19:30.240
as you've seen when things, but things happens, can I roll out on feature flags? I didn't even

19:30.240 --> 19:34.640
went through this, but feature flags, can I roll out all these features, allow you to do this

19:34.640 --> 19:42.880
in a safer way, and you can use AI to automate all these things. So, I, for one, welcome,

19:42.960 --> 19:52.800
or new robot overlords, and you have all the source codes, our rollouts, the metric plug-in

19:52.800 --> 20:00.080
that uses LLMs to do the analysis, and rollouts demo, that's this demo, and the coronavirus,

20:00.080 --> 20:07.840
say it in, you can write your on the agent, but let me also show you, I was going to show you

20:07.920 --> 20:25.200
when something good happens, I think it's, was it here? So, with Jim, let's see, I had a lot here,

20:28.880 --> 20:35.600
it goes and tries to run a bunch of tools, so you can see how the LLMs things, it goes,

20:35.600 --> 20:43.760
I execute into these pods, I execute into get logs, get events, the back pod, get logs again,

20:43.760 --> 20:49.120
get events again, and so on. So, you see how it's trying to figure out, because basically,

20:49.120 --> 20:55.600
I'm not, and you can write more complex prompts, I'm giving it more information. I didn't bother

20:55.600 --> 21:00.160
too much, it just keeps going, okay, I'm going to check out the logs, check the events, check

21:00.160 --> 21:14.000
these, check that, and then it's going to do it by itself, okay, so any, any questions, yes?

21:14.000 --> 21:29.200
How do you think, is it ready for the production? No, I mean, you can, there's nothing preventing

21:29.200 --> 21:33.840
you from running this, because the worst thing is going to happen, it's not going to roll out

21:33.840 --> 21:40.800
something, but that's, that's something that is fairly new, I, I brought it in September,

21:40.800 --> 21:49.360
so, because I think this is the, this is a good way to get it, because then you, you don't

21:49.360 --> 21:54.800
need to take any risk, the worst thing that can happen is that either you promote something that

21:54.800 --> 22:01.200
is bad, sure, that will be bad, or roll back something that is good, that's not a big problem, yeah?

22:01.200 --> 22:15.120
Yes? Question, how you connect, how do you decide what matters to provide and what matters

22:15.120 --> 22:24.720
to not provide to the existing, what I mean, for example, as you told before, you just read the logs

22:24.720 --> 22:33.360
of the shirts and pots, but can you read the logs of other elements and see the possible

22:33.360 --> 22:37.760
news? Okay, so, so the question is, how do I define what matters to provide and what not?

22:37.760 --> 22:44.480
With the agentic part, I don't provide that, you can use ARGRA allows without AI, and you can

22:44.480 --> 22:49.920
say, go to premises, get the matrices and what not, in your agent, in my example, agent, I don't

22:49.920 --> 22:54.880
do, I only do list pots, get pots, get events and so on, but you could have a tool for your

22:54.880 --> 23:00.560
LLM that says, get metrics, and but then you have to say, okay, get metrics, what it means,

23:00.560 --> 23:04.560
it means going to this premises server and then you can get the metrics, but you would have

23:04.560 --> 23:11.920
to define how the LLM can go on fetched metrics, then a LLM will then tell you, I want to call

23:11.920 --> 23:18.000
this tool, get metrics, and that your agent is going to do and make those calls and give the

23:18.080 --> 23:23.040
results to the LLM, so you could also have metrics embedded into the LLM reasoning.

23:32.320 --> 23:38.320
You can connect as many of the sources, there's no limit other than how the reasoning

23:38.320 --> 23:44.160
of the LLM, if you also give too much information to agents, at that point, you would go into

23:44.160 --> 23:48.880
a multi-agent system where you would have multiple agents collaborating and talking to each other,

23:48.880 --> 23:53.680
so you may be you have an agent to form metrics, you have an agent for Kubernetes, you have an

23:53.680 --> 23:59.440
agent for something else, and they would talk to each other to figure out to get data from one another.

24:02.320 --> 24:03.760
Okay, thank you.