WEBVTT

00:00.000 --> 00:02.000
You

00:30.000 --> 00:59.000
Yeah, it's full 4 p.m.

00:59.000 --> 01:06.000
So let's start with the next talk.

01:06.000 --> 01:24.000
Let's go.

01:24.000 --> 01:33.000
Awesome, thank you for the last 4 years for the last 4 years.

01:33.000 --> 01:39.000
I'm responsible for running our POSPASPASPIL infrastructure, which is supporting GitLab.com.

01:39.000 --> 01:46.000
The 8 years before that, I worked at the open source, and especially POSPASPIL consulting company.

01:46.000 --> 01:54.000
Today, I would like to talk to you about how we execute POSPASPIL and operation system upgrades at GitLab with zero downtime.

01:54.000 --> 01:58.000
And I would like to do that by answering a few questions.

01:58.000 --> 02:00.000
We start with POSPASPIL upgrades.

02:00.000 --> 02:02.000
How do they work and why are they hard?

02:02.000 --> 02:10.000
We talk about OS upgrades, what's the deal there, and then I would show you how we minimized the impact for our users.

02:10.000 --> 02:16.000
First of all, POSPASPIL upgrades are major upgrades to be major upgrades.

02:16.000 --> 02:17.000
Why are they hard?

02:17.000 --> 02:24.000
If you have a status service like an application server, you might get away with just changing the running process.

02:24.000 --> 02:30.000
For example, POSPASPIL, a pooling agent, you can just switch out the running process without any interference.

02:30.000 --> 02:36.000
Or if you have basically any application server, you just start up new application servers, paid all the old ones,

02:36.000 --> 02:40.000
and you can easily upgrade without having any impact of your users.

02:40.000 --> 02:44.000
But if you have a service that comes with a state, it's much, much harder,

02:44.000 --> 02:49.000
and especially for complex relational databases like POSPASPIL,

02:49.000 --> 02:54.000
when you do a major upgrade, you might not just change the process, not just the code that is running,

02:54.000 --> 02:58.000
you might need to change the data that is store.

02:58.000 --> 03:04.000
And if you have a large data base, you have to maybe a lot of data to change, or at least to go through.

03:04.000 --> 03:10.000
And for POSPASPIL, a major upgrade, it often occurs that the control structure, the structure,

03:10.000 --> 03:13.000
how the data is stored on this is optimized and get changed.

03:13.000 --> 03:19.000
So you have to rewrite your state, and that's not something that happens instantly.

03:19.000 --> 03:28.000
So, the recommended default method to upgrade the POSPASPIL database,

03:28.000 --> 03:32.000
which is still the default, and if that works for you, you should definitely do it.

03:32.000 --> 03:34.000
You start out with your database.

03:34.000 --> 03:39.000
Your data is stored in a binary format on this, that might depend on the libraries to use on your system.

03:39.000 --> 03:44.000
It depends on your architecture, so the data will look different on an x86 machine, then on a risk.

03:44.000 --> 03:51.000
For example, an Indian is, so this data cannot be easily transferred to a completely different architecture,

03:51.000 --> 03:54.000
and you cannot use it with a different major version of POSPASPIL.

03:54.000 --> 04:01.000
So the default method is you export this data into a logical format that is independent of your architecture and your version.

04:01.000 --> 04:07.000
For example, SPL will do the trick, but there's also an internal format that is slightly more optimal.

04:07.000 --> 04:13.000
So you start out by exporting your data, you go from the binary stored data to a logical representation,

04:13.000 --> 04:22.000
and then on your new server, you just import the data and it gets transferred again to the binary data on your system,

04:22.000 --> 04:25.000
and afterwards you have to recreate all helping structures.

04:25.000 --> 04:30.000
That's mostly indexes, but also statistics and so on.

04:30.000 --> 04:36.000
This is the safest method because you can be sure that all the data is possible,

04:36.000 --> 04:39.000
because it was exported once and then passed again.

04:39.000 --> 04:45.000
There are for many data types, you have special validation, for example, if you have JSON stored in your database,

04:45.000 --> 04:50.000
it gets passed again, so if anything is fishy there, you will notice that.

04:50.000 --> 04:56.000
And also all your helping structures are fresh, you get like square fresh indexes, no blood.

04:56.000 --> 05:01.000
But the problem is if you have a large database that clicks, takes quite some time.

05:01.000 --> 05:08.000
Our main database currently has like 40 terabytes of data, and for us this operation would take multiple days.

05:08.000 --> 05:13.000
If it works for your database, please do that.

05:13.000 --> 05:19.000
You can listen to the rest of my talk, but if that works, please do it.

05:19.000 --> 05:28.000
The next method is still fairly safe, but you have to think about a few things, and that's PG upgrade.

05:28.000 --> 05:35.000
PG upgrade is a tool that gets released with every post-pascale release, and it knows the current data structure,

05:35.000 --> 05:40.000
and it knows all previous data structures, so it will know which parts have to change.

05:40.000 --> 05:45.000
It goes through all your heap data, or your binary data on disk, and changes all the parts that need to be changed.

05:49.000 --> 05:53.000
Quite simple, it's reasonable, fast, it's reasonable to save.

05:53.000 --> 05:58.000
You have to think of a few things, for example, the helping structures like indexes will not be updated.

05:58.000 --> 06:04.000
So if the new post-pascale version comes with an optimized version of B3 indexes, and you want to profit from it,

06:04.000 --> 06:08.000
you have yourself, you have to restore, recreate the indexes.

06:08.000 --> 06:16.000
Or if you would like to go to a new operating system or different architecture, you cannot use this method safely, so keep that in mind.

06:16.000 --> 06:25.000
But still, if this fulfills your need, awesome, use this method, and don't look deeper into the next thing.

06:25.000 --> 06:32.000
So why can't we use this, why can't we use these methods for GitLab?

06:32.000 --> 06:37.000
We have actually done that before, so when I joined GitLab, we would use PG upgrade.

06:37.000 --> 06:44.000
But we had the business requirement before going to a new post-pascale version to make sure that it's operational,

06:44.000 --> 06:51.000
that all our data is correct, and all that the application is not, that we don't have a performance degradation,

06:51.000 --> 06:58.000
and so we needed to run significant number of tests, and the overall downtime for an upgrade was like four to six hours.

06:58.000 --> 07:07.000
So that was for us, not really feasible, and basically just to give you a little insight on our usage profile,

07:07.000 --> 07:13.000
we have over 50 million users around the world, basically in all time zones, and even our team,

07:13.000 --> 07:18.000
we have like over two and a half thousand people who all use the platform for their daily work.

07:18.000 --> 07:25.000
So the requirement we call was basically, we can't afford any downtime, so we had no budget for downtime,

07:25.000 --> 07:33.000
and also the requirement is that after we upgrade any component, if we then realize that we get a strong performance degradation,

07:33.000 --> 07:35.000
we have to be able to roll back.

07:35.000 --> 07:41.000
And many situations, if you do an upgrade, and you have a performance degradation, you have other means of optimizing,

07:41.000 --> 07:48.000
for example, you can buy the next larger cloud instance, if you're running on a hyperscaler, or you can buy the next larger piece of hardware,

07:48.000 --> 07:53.000
because we are basically running on quite the largest machines already, that's not an option.

07:53.000 --> 08:02.000
So if we would experience a performance degradation, we need to be able to roll back quite swiftly.

08:02.000 --> 08:07.000
So yeah, I mentioned zero downtime, but what is zero downtime?

08:07.000 --> 08:13.000
Because we are talking about the software as a service application, and nothing, there's no instant reaction.

08:13.000 --> 08:18.000
So if you press the button, like creating issue, the issue is not there instantly.

08:18.000 --> 08:24.000
It will take like 100 milliseconds, 200 milliseconds depending on where you are.

08:24.000 --> 08:33.000
So we cannot aim for zero, that's not possible, so we need a method to decide, like, we need a metric to decide what is zero downtime.

08:33.000 --> 08:38.000
And what we do is we define that we don't want any noticeable user impact.

08:38.000 --> 08:44.000
And the good thing here is we already have a metric for user, I had a metric for user impact already,

08:44.000 --> 08:49.000
and that is called AppDex, the application performance index.

08:49.000 --> 08:53.000
We do samples for different user interactions.

08:53.000 --> 09:00.000
So for example, for creating an issue, or running a CI job, and we define what is satisfying for the user.

09:00.000 --> 09:07.000
Like, you click the button, new issue, and after a few hundred milliseconds, the issue pops up.

09:07.000 --> 09:11.000
It feels snappy, that's like feeling satisfying.

09:12.000 --> 09:18.000
If you're clicking on create new issue, and you have to wait maybe one or two seconds, that does not feel too great.

09:18.000 --> 09:25.000
You're looking at the blank screen, or the progress indication, that does not feel great, but you tolerate it.

09:25.000 --> 09:33.000
If the action takes significantly longer, the user might get annoyed and press F5, try again, and that's a frustrating experience.

09:33.000 --> 09:40.000
So for a lot of actions within our application, we define these thresholds, and then we take continuously, we take samples for that.

09:40.000 --> 09:43.000
And calculate the AppDex.

09:43.000 --> 09:48.000
So for example, here you see my laser pointer is not really visible.

09:48.000 --> 09:55.000
But you see on the top line, there's satisfied count, so if one hundred percent of all requests would be satisfied.

09:55.000 --> 10:02.000
If that is factory, then we have like one hundred samples, one hundred samples divided by one hundred is one.

10:02.000 --> 10:05.000
So if everything is perfect, we get one.

10:05.000 --> 10:14.000
And if all samples would be frustrating, it's multiplied by zero, so we get a zero, we divide it by whatever we get a zero.

10:14.000 --> 10:21.000
So it's scaled from one to zero, zero is nobody satisfied, everyone is frustrated, and one would be everyone satisfied.

10:21.000 --> 10:26.000
That's the metric we have in already in place, and we page page for this metric.

10:26.000 --> 10:32.000
So if it goes down below 98 point something percent, people get page.

10:32.000 --> 10:34.000
Yeah, how do we achieve that?

10:34.000 --> 10:39.000
And there's a really cool method in post-pressure called logical replication.

10:39.000 --> 10:45.000
So basically the thing we saw at the beginning, where you transform the physically stored data into logical replication,

10:45.000 --> 10:50.000
into logical representation, can be used in a replicating format.

10:50.000 --> 10:55.000
And we use quite some automation to make it actually feasible.

10:55.000 --> 10:58.000
So what is logical replication?

10:58.000 --> 11:02.000
Yeah, or what does it give us?

11:02.000 --> 11:06.000
Unlike extreme replication, logical replication, because it uses the logical format,

11:06.000 --> 11:14.000
can replicate between different major post-pressure versions, and even different infrastructures, different architectures.

11:14.000 --> 11:25.000
So we can just clone our current production, upgrade it, bring it in sync again with the main production system, and then switch over later.

11:26.000 --> 11:28.000
One thing, doesn't come with restriction?

11:28.000 --> 11:31.000
Yeah, it comes with quite some restrictions.

11:31.000 --> 11:36.000
In the previous talk, when I talked about the trivial method, we used modular method, we used before.

11:36.000 --> 11:39.000
I go into more detail of the restrictions.

11:39.000 --> 11:41.000
You can watch a recording if you like.

11:41.000 --> 11:48.000
But for the scope of this talk, I will only go to the main restriction, which is during the logical replication,

11:48.000 --> 11:52.000
while it's enabled, we can't use any DDL.

11:53.000 --> 11:57.000
DDL is data definition language, so create table, author table, drop table.

11:57.000 --> 12:00.000
That's more possible, unfortunately.

12:00.000 --> 12:05.000
For GitLab, we solved that in a way that we have two features for that.

12:05.000 --> 12:14.000
One is a process feature that we block our delivery colleagues from deploying new GitLab versions that would alter the data.

12:14.000 --> 12:16.000
And they also get one week ahead.

12:16.000 --> 12:20.000
We're running, hey, next week or not allowed to change the database.

12:20.000 --> 12:26.000
And also we have a feature flag that you can use to block DDL from happening.

12:26.000 --> 12:28.000
That includes the deployments.

12:28.000 --> 12:31.000
But also we have some background work, for example, that do partitioning.

12:31.000 --> 12:35.000
We have really large multi-terabyte tables that get partitioned in the background.

12:35.000 --> 12:38.000
All these jobs are frozen during the time period.

12:38.000 --> 12:43.000
If you are, if you want to do a zero downtime upgrade for GitLab, you can use the same feature flag.

12:43.000 --> 12:47.000
If you use any other software, you might not have the problem in the first place,

12:47.000 --> 12:51.000
because most standards software does not a lot of DDL in the background.

12:51.000 --> 12:57.000
But you definitely have to check for your concrete application beforehand.

12:57.000 --> 13:03.000
So I work you through a simplified simplified version of the process.

13:03.000 --> 13:09.000
First we start out with our current version of PostgresQL database in this case, PostgresQL 16,

13:09.000 --> 13:12.000
and our application talking to it.

13:12.000 --> 13:15.000
Then we create a one-to-one copy.

13:15.000 --> 13:20.000
In our case, we just create new virtual machines from snapshots of the currently running ones.

13:20.000 --> 13:31.000
And we use the standard method of streaming replication to get the second instance synchronized.

13:31.000 --> 13:37.000
There's a good thing, PostgresQL by design, writes all of its data changes into something called the write-a-headlock.

13:37.000 --> 13:43.000
In the binary format, and you can just stream that to a different server to keep it on the same state as the source.

13:43.000 --> 13:52.000
So our starting system is called source, and the system we are replicating to is called target here.

13:52.000 --> 13:57.000
Then we can stop the replication and upgrade our target system.

13:57.000 --> 14:00.000
So we would run the program you saw before, PG upgrade.

14:00.000 --> 14:03.000
We upgrade the data on this.

14:03.000 --> 14:05.000
We can recreate the indexes.

14:05.000 --> 14:10.000
And afterwards we use logical replication to sync it again.

14:10.000 --> 14:18.000
So now we have the old database version running and new cluster with the new version,

14:18.000 --> 14:22.000
and we get the same data as I am.

14:22.000 --> 14:30.000
And yeah, once we are satisfied, we can just switch the application from connecting from the old one to the new one.

14:33.000 --> 14:41.000
And that's basically the state we had in end of 2023.

14:41.000 --> 14:45.000
And now to give a small look into the actual user impact.

14:45.000 --> 14:47.000
This is our app text.

14:47.000 --> 14:52.000
And yeah, to give you a good static view that's basically the nitpicking view.

14:52.000 --> 14:54.000
I'm looking at the top one percent.

14:54.000 --> 15:00.000
So we are seeing from 0.99 to 1.0 app text.

15:00.000 --> 15:04.000
So here is our degradation as well.

15:04.000 --> 15:06.000
98.8 percent.

15:06.000 --> 15:10.000
So when people get patched, this line is not invisible in this graph.

15:10.000 --> 15:12.000
It's a little bit below.

15:12.000 --> 15:15.000
And yeah, you see that's one week of data.

15:15.000 --> 15:19.000
And you see that it's not constantly at the same level.

15:19.000 --> 15:21.000
We sometimes have degradation.

15:21.000 --> 15:23.000
It can be the application database.

15:23.000 --> 15:25.000
Or maybe a server is 14.

15:25.000 --> 15:28.000
We need to start a new instance or we have more load.

15:28.000 --> 15:31.000
We start more pods, things like that.

15:31.000 --> 15:34.000
And this window here was one of our switch overs.

15:34.000 --> 15:38.000
So you can see the impact on our users is basically measurable.

15:38.000 --> 15:42.000
So during the switch over, we have a short degradation window.

15:42.000 --> 15:44.000
But it's less than the normal noise.

15:44.000 --> 15:47.000
So it's measurable, but it's not significant.

15:47.000 --> 15:55.000
So yeah, we were able to do post-precipeal upgrades with basically zero-down-time zero-user impacts.

15:55.000 --> 15:57.000
But that's old news.

15:57.000 --> 16:01.000
So today, the focus is what did we do for improvement since then?

16:01.000 --> 16:03.000
Or what can we improve further?

16:03.000 --> 16:05.000
And we have two main things here.

16:05.000 --> 16:07.000
One is more of a business requirement.

16:07.000 --> 16:09.000
And that was that the switch over.

16:09.000 --> 16:12.000
When we are switching over from the new version to the new version.

16:12.000 --> 16:14.000
This was a point of no return.

16:14.000 --> 16:16.000
We couldn't switch back.

16:16.000 --> 16:19.000
And in the past, we had the problems that after switch over.

16:19.000 --> 16:21.000
We had a performance degradation.

16:21.000 --> 16:24.000
And the requirement here is to remove this uncertainty.

16:24.000 --> 16:25.000
This risk.

16:25.000 --> 16:30.000
So we need to change the process in order to be able to roll back even after this switch over.

16:30.000 --> 16:33.000
And the second thing is something that I really want.

16:33.000 --> 16:38.000
Because this whole process only upgrades, post-precipeal.

16:38.000 --> 16:41.000
Not any libraries, not the operating system.

16:41.000 --> 16:46.000
So I would like to combine both to reduce the labor involved.

16:46.000 --> 16:53.000
Okay, how do we remove or move the point of no return to a later state?

16:53.000 --> 16:56.000
And that's relatively easy.

16:56.000 --> 16:59.000
All the technology we need is already in place.

16:59.000 --> 17:06.000
Because after the switch over, we can reverse the replication and stream all the data back to the old

17:06.000 --> 17:07.000
old clusters.

17:07.000 --> 17:09.000
So the old clusters kept in swing.

17:09.000 --> 17:14.000
And if we come to the conclusions that we can't fix the new cluster, we can roll back.

17:14.000 --> 17:19.000
In reality, we would put a lot of people into optimizing the queries and making a new cluster perform.

17:19.000 --> 17:22.000
But if that's no longer possible, we have an option left.

17:22.000 --> 17:31.000
So after the upgrade, we would leave the old cluster alive and replicate by a streaming

17:31.000 --> 17:36.000
via a logical replication in the old cluster.

17:36.000 --> 17:38.000
And we can operate in monitor.

17:38.000 --> 17:43.000
And if we come to the conclusion that we really have to roll back, we can switch back to the old one

17:43.000 --> 17:48.000
without losing any data.

17:48.000 --> 17:53.000
So now to the OS upgrade.

17:53.000 --> 17:54.000
Why OS upgrades a problem?

17:54.000 --> 18:01.000
Why can't I just create a new server with a new Linux version of my favorite distribution

18:01.000 --> 18:04.000
and just switch over to this version?

18:04.000 --> 18:08.000
There can be multiple problems with libraries, but the major thing here is something called

18:08.000 --> 18:09.000
Collation.

18:09.000 --> 18:13.000
Like new operating systems, system versions normally come with a new C library for most

18:13.000 --> 18:18.000
it's the GLIPC and the GLIPC provides something called the system wide collation.

18:18.000 --> 18:21.000
And that's the sort or how you order strings.

18:21.000 --> 18:23.000
And on first glance, it should be super easy.

18:23.000 --> 18:30.000
Like if you want to order a single character, like an ABC, it's obvious you have L, first the A, then the B, then the C.

18:30.000 --> 18:35.000
It becomes a much more problematic if you have lower case in upper case.

18:35.000 --> 18:40.000
If you want to order a list of lower and upper case letters, do you want to have first A lower case

18:40.000 --> 18:45.000
or do you want to order ABCD until the end and then start with the upper case letters?

18:45.000 --> 18:50.000
But it's more complicated if you have, for example, strings of numbers, what should come first?

18:50.000 --> 18:54.000
0, 1, or 2, or with special characters.

18:54.000 --> 18:58.000
And unfortunately, there's not one perfect collation that never changes.

18:58.000 --> 19:00.000
Unfortunately, the collation changes regularly.

19:00.000 --> 19:07.000
So when you get a new operating system, you have to accept that your strings will be sorted differently

19:07.000 --> 19:08.000
than before.

19:08.000 --> 19:11.000
And if you have a stateless application, that's not a big deal.

19:11.000 --> 19:19.000
But if you have, like indexes that were created with one collation and all your data was ordered after one ordering pattern.

19:19.000 --> 19:23.000
And then you switch to a new operating system and it starts to order stuff otherwise.

19:23.000 --> 19:27.000
Then you have the problem that you can't find values that already in your index.

19:27.000 --> 19:32.000
If you use the index for search, that's already annoying because you can't find the stuff you already have.

19:32.000 --> 19:42.000
But if you use the index to, for example, and force a constraint, like a unique constraint, you are now able to put tablets in your database, even so they should be unique.

19:42.000 --> 19:47.000
And that will break your data and corrupt your data.

19:47.000 --> 19:52.000
Yeah, how do we solve this problem?

19:52.000 --> 19:55.000
The first thing is, it's not applying for all indexes.

19:55.000 --> 20:00.000
For example, there are some simple data types where you don't have a special collation like an integer.

20:00.000 --> 20:05.000
It's quite easy to sort an integer you can do it in a like a decimal representation or binary.

20:05.000 --> 20:07.000
There's not much to it.

20:07.000 --> 20:16.000
But basically, the simple solution is to rebuild all of your indexes to complex or collation based data structures, mostly strings.

20:16.000 --> 20:23.000
So the good thing is if you do the upgrade, which I suggested in the beginning, the dumpry store upgrade basically, you don't have the problem.

20:23.000 --> 20:26.000
All of the indexes will be rebuilt anyways.

20:27.000 --> 20:33.000
But if you use any of the other upgrade methods, you have to do it yourself manually.

20:33.000 --> 20:46.000
And if you would just rebuild all indexes for GitLab, it will also take multiple days, so that's not feasible for us.

20:46.000 --> 20:49.000
So what do we do to make it feasible?

20:50.000 --> 20:59.000
Along with for the upgrade, we create a new system, we do a test upgrade with a production copy, and then we use a postgresical internal functions called AM check.

20:59.000 --> 21:04.000
And we check which indexes would be corrupted if we upgrade.

21:04.000 --> 21:10.000
Then we take, make a list out of all this indexes and recreate them.

21:10.000 --> 21:18.000
Should it be fairly easy operation and we can just recreate them, also we just take the list and start with our upgrade process.

21:18.000 --> 21:22.000
If the recreation of all of those would take too long, we have to optimize.

21:22.000 --> 21:34.000
You can, for example, try to find if some of these indexes might not be needed at all, or you can try to use different index type, or maybe you can break down an index, maybe a one large index for the whole table.

21:34.000 --> 21:43.000
You can do a lot of partial, partial indexes that are only indexing certain, certain areas, certain, certain spans of a table, things like that.

21:43.000 --> 21:54.000
And also you can do something like lazy recreate. For example, if you come to conclusion, we have really large index, and it's used for to optimize full text search on all of your issues.

21:54.000 --> 22:00.000
And the negative effect, if this index would slightly corrupt would be that.

22:00.000 --> 22:10.000
In a totally weird event, somebody couldn't find an issue that has all the German umlauter in it, and yeah, you can decide, okay, that's not super critical.

22:10.000 --> 22:19.000
We are fine with not creating this index during the upgrade window. We are fine if it takes to the next Monday morning, because we will have low data corruption.

22:19.000 --> 22:30.000
We only will have a slide function degradation here. So if you're looking into that beforehand, you can just make sensible decisions.

22:30.000 --> 22:36.000
So to give you a perspective how you normally do it, we do all these upgrades on the weekend, because that's the lowest load.

22:36.000 --> 22:41.000
So face for us, still not no load unfortunately, but the lowest.

22:41.000 --> 22:48.000
Yeah, and we would do the on Saturday morning, we would start to all the steps required for an upgrade.

22:48.000 --> 22:55.000
And then we have still until Sunday to do additional maintenance operations like index with creations.

22:55.000 --> 23:02.000
So depending on how long the first steps take, we have at least 12 to 24 hours for that.

23:02.000 --> 23:10.000
So we will recreate all indexes, and afterwards we run this internal function A amp check again to make sure that we really have no corruption.

23:10.000 --> 23:18.000
And we mostly have most of the time we also have time for additional running additional sanity checks.

23:18.000 --> 23:25.000
Cool, and now I would like to work through the full process how we did last year with all the improvements.

23:26.000 --> 23:35.000
We are going from possibly 16 to 17 and from Ubuntu 20 to 2022 to 2022 or 4.

23:35.000 --> 23:40.000
Okay, we start out with our database again and the application running to it.

23:40.000 --> 23:45.000
And because it's oversimplified, I give you a little glimpse what's behind the symbols.

23:45.000 --> 23:54.000
So the GitLab acting on the left, that's our application stack, most of it runs on Kubernetes, except our radius.

23:54.000 --> 24:02.000
And the database for this case, it's nine large instances, and one smaller one for taking snapshots.

24:02.000 --> 24:06.000
And we have distributed them across three availability zones.

24:06.000 --> 24:20.000
And we start out with 20 or four and possibly 16. On the top right, I have like a traffic light icon to show you if at the current phase data definition language is allowed.

24:20.000 --> 24:27.000
And for us, it's quite important because multiple times a week, my colleagues deploy new versions and sometimes they want to change to schema.

24:27.000 --> 24:33.000
So one of my requirements was to keep the phase where they can't execute the DL to the minimal minimum.

24:33.000 --> 24:38.000
So the first step is we created test cluster to get all our metrics to know what we are dealing with.

24:38.000 --> 24:43.000
The test cluster can be minimal in our case, it's then like at least three nodes.

24:43.000 --> 24:50.000
Also, yeah, that's not affecting production at all. It's already starting out with the newer S version, the new prosperous version.

24:50.000 --> 24:56.000
And then we do a mock update, upgrade. So we upgrade our test cluster.

24:56.000 --> 25:02.000
We get all the correct execution times, we need to know exactly how long each step takes.

25:02.000 --> 25:09.000
And also we get a list of all the correct indexes because we run the AM check tooling.

25:09.000 --> 25:16.000
If we have all metrics that we need, we can schedule our upgrade, we remove the test cluster and we create the actual target cluster.

25:16.000 --> 25:25.000
In our case, again, it's nine large nodes, over three availability zones and one backup node.

25:25.000 --> 25:28.000
And now comes the step where we have to disable the DL.

25:28.000 --> 25:37.000
Because now we have to switch from streaming replication, like where we just sent the actual byte data from the source to the target cluster.

25:37.000 --> 25:44.000
Now the source cluster has to translate this binary data into a little logical representation and have sent it to target.

25:44.000 --> 25:52.000
And DDL would break the process. So now begins to face where my colleagues are no longer allowed to deploy new schema changes.

25:52.000 --> 25:59.000
That's by the way, that's Saturday morning. In this case.

25:59.000 --> 26:09.000
Now we stop the replication, we upgrade our target cluster, so we run KG upgrade runs like 2020 something minutes, I guess.

26:09.000 --> 26:14.000
Then we re-synchronize again, we have the same state, we get fresh data again.

26:14.000 --> 26:20.000
And then we can do all the additional steps, so we can do a full re-index to make sure we don't have the corrupted indexes.

26:20.000 --> 26:26.000
Afterwards, we can run Analyze, Analyze goes through all your data and creates statistics, which are possible.

26:26.000 --> 26:28.000
It's super important to have fresh statistics, it's a possible node.

26:28.000 --> 26:31.000
Can I do an index? Can I have to read the full table?

26:31.000 --> 26:37.000
Or if you have partition tables, it knows, oh for this query, I have to look into these petitions.

26:37.000 --> 26:48.000
And also, yeah, we run the corruption check to make sure that our index mitigations were successful.

26:48.000 --> 26:59.000
And now we can start with the switch over. The first thing is, we will, the application talks, in this step, the application talks to the source cluster to the PG-161.

26:59.000 --> 27:05.000
And then we gradually start to load balance, read only queries to the new cluster.

27:05.000 --> 27:13.000
So we start with one replica, so all of the queries get scattered across all of the standby in the old cluster, the source cluster,

27:13.000 --> 27:18.000
and one standby of the target cluster gets in the load balance.

27:18.000 --> 27:22.000
And that's really cool because we have a nice dashboard and we can see the performance.

27:22.000 --> 27:24.000
We can make an applet to Apple's comparison.

27:24.000 --> 27:30.000
How much we make sure all of the standby get roughly the same number of connections.

27:30.000 --> 27:33.000
And we can Apple to Apple compare the performance.

27:33.000 --> 27:38.000
We can see the new standby does it have more CPU load, does it do more IOPS, things like that.

27:38.000 --> 27:45.000
In general, if we are going to a new possible version performance increases, but sometimes not because query plan flips.

27:45.000 --> 27:51.000
And then we can have hours and hours of time without any impact that we can optimize the queries.

27:51.000 --> 27:58.000
So during the process, we have like the back and developers on call and call them in if you find the performance problem.

27:58.000 --> 28:02.000
And we can optimize the queries long before we do the full switch over.

28:02.000 --> 28:08.000
Yeah, if we are satisfied, we get all of the read load to the target cluster.

28:08.000 --> 28:14.000
So all of the rides still happen to source cluster, and then they propagate of a logical replication to the target cluster.

28:14.000 --> 28:17.000
And all the read load goes to the target cluster already.

28:17.000 --> 28:20.000
And yeah, we can monitor the performance.

28:20.000 --> 28:23.000
If it looks great, we run our full end to end.

28:24.000 --> 28:28.000
And they are quite quite significant.

28:28.000 --> 28:31.000
It basically uses all of the features that Hitler has.

28:31.000 --> 28:34.000
It creates issues, it does, it does, it does.

28:34.000 --> 28:38.000
Yeah, basically calls calls a lot of the functions, also writeable.

28:38.000 --> 28:42.000
And also measures not only that it works, but also the performance.

28:42.000 --> 28:48.000
And that's really tricky because this construct here, all the data is written to the source cluster and then replicated to target cluster.

28:48.000 --> 28:52.000
It's improved, this increases the latency.

28:52.000 --> 28:58.000
So we even needed to make our QA test a bit more resilient so that works still works.

28:58.000 --> 29:04.000
But now that's a fairly fairly great method to test the complete functionality.

29:04.000 --> 29:12.000
And also to get inside of the performance because we again can compare.

29:12.000 --> 29:17.000
And if you are satisfied with the performance, all tests are successful.

29:17.000 --> 29:19.000
We can do the full switch over.

29:19.000 --> 29:24.000
Now we also, we break the, now we break the replication from source to target.

29:24.000 --> 29:31.000
And we move all of the, we load balance all of the right queries to the target cluster as well.

29:31.000 --> 29:35.000
And this part basically was for two years ago, we also had that.

29:35.000 --> 29:38.000
But now comes the important upgrade.

29:38.000 --> 29:45.000
Now we don't have to, to pray because we still have a way to roll back should perform.

29:45.000 --> 29:50.000
And we keep that running for the full of Monday because during Sunday, as mentioned, we don't have too much load.

29:50.000 --> 29:55.000
So if there would be edge cases that only happen during peak hours, we might not find them on Sunday.

29:55.000 --> 30:00.000
So we still, on Monday when the load starts, we first see a peak on Europe work hours.

30:00.000 --> 30:04.000
So when European start work, we see the load rising.

30:04.000 --> 30:09.000
And then we see when the US east coast starts to work. And then we see when the US west coast start working.

30:09.000 --> 30:14.000
And this is, this is the critical phase. Like we have to use west coast starts to work.

30:14.000 --> 30:17.000
And still people in the east coast works, people in Europe still works.

30:17.000 --> 30:20.000
That's normally our peak hours a day.

30:20.000 --> 30:27.000
And there we have to really have to monitor that everything is optimal or we have to on the fly optimized.

30:27.000 --> 30:30.000
Yeah, and we, we run that for the full of Monday.

30:30.000 --> 30:35.000
Still logical replication is needed for that. So we can keep the old cluster in sync.

30:35.000 --> 30:45.000
And Tuesday morning, Europe time, if everything worked as expected, we stopped the replication, we removed the old cluster.

30:45.000 --> 30:48.000
And now we are fully on 17.

30:48.000 --> 30:53.000
And because we stopped the logical replication, DDL is possible again.

31:00.000 --> 31:03.000
Also my rush quite through that.

31:03.000 --> 31:07.000
So I, I really hope you have some questions for me.

31:07.000 --> 31:15.000
If you, if you want to deep dive into that, we have here an overview about our database infrastructure.

31:15.000 --> 31:21.000
Underneath, I didn't mention that before, all of the actions you have seen are not executed manually.

31:21.000 --> 31:25.000
We have everything in Ansible Playbox. We do that multiple times a year.

31:25.000 --> 31:27.000
It would be super tedious to do it manually.

31:27.000 --> 31:29.000
And it would introduce new human error every time.

31:29.000 --> 31:34.000
So every single incident ends up with playbooks linked to the repo you can see there.

31:34.000 --> 31:41.000
And also for us in a fairly large company from my point of view, I have to coordinate this with a lot of people.

31:41.000 --> 31:47.000
I have to coordinate this with a lot of people. I have to coordinate with delivery.

31:47.000 --> 31:52.000
Some of our larger customers would like to be informed of such things.

31:52.000 --> 31:58.000
And so I have to coordinate with a lot of a lot of different different stakeholders in the company.

31:58.000 --> 32:02.000
And also I have to set up schedules for on call.

32:02.000 --> 32:06.000
And we at GitLab, we use GitLab to organize ourselves.

32:06.000 --> 32:10.000
So we use GitLab issues and it affects to organize such upgrades.

32:10.000 --> 32:15.000
And I have, we have a large template, an issue template.

32:15.000 --> 32:18.000
So when we do a new upgrade, we create a new issue based on this template.

32:18.000 --> 32:25.000
And it has all the checklists to inform what's much requests to create in order to change the roles.

32:25.000 --> 32:28.000
And basically all the steps are in there.

32:28.000 --> 32:32.000
This thing is fairly large and it will most likely not work for your organization is.

32:32.000 --> 32:38.000
But if you have to organize such an upgrade would be really interesting for you to go through it.

32:38.000 --> 32:42.000
Also, I oversimplified a few steps to press it in the time.

32:42.000 --> 32:48.000
If you are more interested in the actual, for example, in the actual caveats of logical replication,

32:48.000 --> 32:54.000
there are a few things about why sequences are a problem I have for recording there.

32:55.000 --> 32:58.000
Yeah, slide text are already updated, upload it.

32:58.000 --> 32:59.000
And I have two versions.

32:59.000 --> 33:00.000
One is the one you are seeing right now.

33:00.000 --> 33:03.000
And one is an extended version with roughly doubled number of slides.

33:03.000 --> 33:06.000
We have additional explanations for you.

33:06.000 --> 33:07.000
Yeah.

33:07.000 --> 33:09.000
And now I really hope you have questions.

33:09.000 --> 33:11.000
You can approach the during the event.

33:11.000 --> 33:12.000
I'm sometimes at the post-presquail booth.

33:12.000 --> 33:15.000
Sometimes at the GitLab booth or running around.

33:15.000 --> 33:19.000
You can also write me something or you can ask right now.

33:19.000 --> 33:22.000
Because apparently we have some time for that.

33:25.000 --> 33:26.000
Yes.

33:26.000 --> 33:27.000
Thank you.

33:31.000 --> 33:32.000
Who is first?

33:32.000 --> 33:33.000
Somewhere right there.

33:38.000 --> 33:39.000
Okay.

33:39.000 --> 33:40.000
I have the following question.

33:40.000 --> 33:43.000
Like you said you are going to 17.

33:43.000 --> 33:51.000
Basically you are staying in one version behind the tip of progress as far as I understand.

33:51.000 --> 33:54.000
But during those periods when you do the upgrade.

33:54.000 --> 33:59.000
Do you also try to see whether you can upgrade to the,

33:59.000 --> 34:03.000
like to the newest version.

34:03.000 --> 34:08.000
Not in order to upgrade just to see whether you will have any problems with that.

34:08.000 --> 34:09.000
Okay.

34:09.000 --> 34:10.000
I summarized the question.

34:10.000 --> 34:13.000
You mentioned that we are going to post-presquail 17,

34:13.000 --> 34:15.000
which is one version behind the current stable.

34:15.000 --> 34:19.000
And you ask me if you have considered going to the latest version.

34:20.000 --> 34:22.000
Okay. Awesome.

34:22.000 --> 34:23.000
Yeah.

34:23.000 --> 34:31.000
In general, we have this policies that we want to go to a post-presquail version before that.

34:31.000 --> 34:36.000
That we want to go to a post-presquail version before the next version comes out.

34:36.000 --> 34:38.000
That's the ideal thing.

34:38.000 --> 34:45.000
But not, but we would not want to go to version before it's older than like half a year.

34:46.000 --> 34:48.000
I really like every new feature.

34:48.000 --> 34:53.000
But I don't have the capacity to find all the cool new bugs in production.

34:53.000 --> 34:55.000
So I'm really happy if somebody else finds them first.

34:55.000 --> 34:59.000
So we have a certain mandatory delay before we go to the new version.

34:59.000 --> 35:02.000
For post-presquail 18, it's a bit different.

35:02.000 --> 35:06.000
Because in post-presquail 18, there are optimizations that we would really love to have.

35:06.000 --> 35:09.000
We have a problem called lightweight log contention.

35:09.000 --> 35:12.000
So post-presquail has explicit logs, which are used in queries.

35:12.000 --> 35:14.000
You can say log table or something.

35:14.000 --> 35:18.000
But there's also an internal construct called lightweight logs.

35:18.000 --> 35:21.000
And that's biting us and our peak hours sometimes.

35:21.000 --> 35:24.000
And post-presquail 18 comes with some optimizations.

35:24.000 --> 35:26.000
So I would really like to go to 18.

35:26.000 --> 35:32.000
So we might make an out-of-band upgrade and basically right now start planning for that.

35:32.000 --> 35:33.000
Awesome.

35:33.000 --> 35:35.000
Thank you for the question.

35:36.000 --> 35:38.000
Hi.

35:38.000 --> 35:44.000
Did I understand correctly that the period that you cannot perform DDL is three days?

35:44.000 --> 35:49.000
And if so, is that not a problem for your change processes?

35:49.000 --> 35:52.000
I couldn't copy a story.

35:52.000 --> 35:53.000
Okay.

35:53.000 --> 36:04.000
Did I understand correctly that the period in which you cannot perform DDL is three days?

36:04.000 --> 36:08.000
And if so, is that not a problem for your change processes?

36:08.000 --> 36:09.000
Yes.

36:09.000 --> 36:10.000
That's correct.

36:10.000 --> 36:17.000
The time frame we can perform DDL is from Saturday to Tuesday morning.

36:17.000 --> 36:18.000
That's the phase.

36:18.000 --> 36:19.000
That's something we choose.

36:19.000 --> 36:20.000
We could make a charter.

36:20.000 --> 36:22.000
But then we don't have a rollback option anymore.

36:22.000 --> 36:23.000
Quick rollback.

36:23.000 --> 36:28.000
Not only before, we only headed to Sunday, Sunday evening.

36:28.000 --> 36:30.000
But we couldn't rollback on Monday.

36:30.000 --> 36:33.000
So for us to trade off was we want to have a larger rollback window.

36:33.000 --> 36:36.000
And therefore we are fine with having the DDL log for longer.

36:36.000 --> 36:38.000
And it's not a large problem.

36:38.000 --> 36:40.000
There are basically two things that can't happen.

36:40.000 --> 36:43.000
We can't deploy new GitLab versions that would change the schema.

36:43.000 --> 36:46.000
And not every GitLab minor update changes schema.

36:46.000 --> 36:48.000
So that's not a huge problem.

36:48.000 --> 36:52.000
And also we can't do background re-indexing.

36:52.000 --> 36:54.000
Which is also not a problem.

36:54.000 --> 36:55.000
We just pause it for the time.

36:55.000 --> 36:58.000
And we can't in the same category.

36:58.000 --> 37:02.000
We can't do background partitioning.

37:02.000 --> 37:03.000
Because we have fairly large tables.

37:03.000 --> 37:07.000
We've been petitions in the background create new new sub tables.

37:07.000 --> 37:09.000
And that's also not a problem.

37:09.000 --> 37:11.000
We pre-create them for like a month or so.

37:11.000 --> 37:15.000
So we could, for this case, we could put DDL of four months.

37:15.000 --> 37:16.000
Or was that any problem?

37:18.000 --> 37:19.000
Awesome.

37:19.000 --> 37:20.000
Thank you for the question.

37:20.000 --> 37:31.000
I have a question related to the rebuilding of the indexes.

37:31.000 --> 37:36.000
And the corruption you mentioned that is maybe possible after the DDLBC upgrade.

37:36.000 --> 37:38.000
Do actually, did it happen to you?

37:38.000 --> 37:39.000
Do you have that experience?

37:39.000 --> 37:41.000
Or is it purely a political?

37:41.000 --> 37:47.000
And in your steps you had a great, you upgraded the database first.

37:47.000 --> 37:50.000
Then enable logical replication and then rebuild the indexes.

37:50.000 --> 37:53.000
So I wonder whether this data corruption could happen.

37:53.000 --> 37:55.000
Whether the audition would be different.

37:55.000 --> 37:59.000
Like first building indexes and then enabling logical replication.

37:59.000 --> 38:00.000
Okay. Awesome.

38:00.000 --> 38:03.000
First question was regarding index corruption.

38:03.000 --> 38:05.000
Did we had index corruption?

38:05.000 --> 38:07.000
And the answer is doing our tests.

38:07.000 --> 38:08.000
Yes.

38:08.000 --> 38:11.000
For the current upgrade going from Ubuntu 20 to 2022,

38:11.000 --> 38:14.000
they are like 20 indexes corrupted.

38:14.000 --> 38:18.000
But it never materialized as a problem because we found out

38:18.000 --> 38:21.000
during our test upgrade and we never went to production with that.

38:21.000 --> 38:23.000
We made the list of indexes.

38:23.000 --> 38:27.000
We optimized and we recreate all of them before we go to production.

38:27.000 --> 38:31.000
And the second question was regarding the sequence of events.

38:33.000 --> 38:36.000
So here when we start out when we create a test cluster,

38:36.000 --> 38:39.000
that's streaming replication because it's the most efficient one.

38:39.000 --> 38:41.000
And then we break it.

38:41.000 --> 38:44.000
We stop then we...

38:44.000 --> 38:46.000
Or that's test cluster.

38:46.000 --> 38:47.000
It's irrelevant.

38:47.000 --> 38:48.000
Okay, here.

38:48.000 --> 38:49.000
We start...

38:52.000 --> 38:53.000
Okay, here.

38:53.000 --> 38:54.000
That starts out.

38:54.000 --> 38:56.000
We create the test cluster.

38:56.000 --> 38:57.000
No.

38:57.000 --> 38:58.000
We create the target cluster.

38:58.000 --> 39:02.000
That starts out with streaming replication because it's just most efficient.

39:02.000 --> 39:05.000
And then we switch it to logical replication.

39:05.000 --> 39:09.000
During here, I mean already here,

39:09.000 --> 39:12.000
the indexes on the target cluster are corrupted.

39:12.000 --> 39:17.000
You couldn't send the live traffic to this cluster because you would get wrong answers.

39:17.000 --> 39:19.000
You couldn't switch over to that.

39:19.000 --> 39:21.000
But it does not matter because we have it in production,

39:21.000 --> 39:22.000
but we don't.

39:22.000 --> 39:24.000
The application does not talk to it.

39:24.000 --> 39:25.000
So at the current state it's corrupted,

39:25.000 --> 39:28.000
but the application does not talk to it.

39:28.000 --> 39:29.000
Answer that.

39:29.000 --> 39:31.000
Is that an answer to your question?

39:31.000 --> 39:32.000
Awesome.

39:32.000 --> 39:33.000
Thank you.

39:36.000 --> 39:37.000
No.

39:37.000 --> 39:38.000
Can you hear me?

39:41.000 --> 39:44.000
So you mentioned briefly that you used the right-to-head log

39:44.000 --> 39:47.000
of post-blastware for managing logical replication.

39:47.000 --> 39:51.000
How do you use any additional tool to handle that replication?

39:51.000 --> 39:55.000
And also managing the NN plus 1 server node use.

39:55.000 --> 39:57.000
Perhaps we're ending a failover.

39:59.000 --> 40:01.000
Just the right-to-head log by itself.

40:01.000 --> 40:04.000
The question is, we use the right-to-head log for logical replication?

40:04.000 --> 40:05.000
Yeah.

40:05.000 --> 40:07.000
Do you use any additional tools?

40:07.000 --> 40:10.000
No, there's something like PG-bound server, for instance.

40:10.000 --> 40:11.000
Okay.

40:11.000 --> 40:15.000
First of all, that misconception, we're not using the right-to-head log for logical replication.

40:15.000 --> 40:21.000
The right-to-head log basically contains the data for the person that I would like one to answer that.

40:21.000 --> 40:23.000
But you have an additional question.

40:23.000 --> 40:24.000
Okay.

40:24.000 --> 40:25.000
Awesome.

40:25.000 --> 40:26.000
Okay.

40:26.000 --> 40:27.000
The right-to-head log is not used for logical replication directly.

40:27.000 --> 40:33.000
For the replication itself, we use the post-blastware build in the future of logical replication.

40:33.000 --> 40:35.000
But we have a lot of different tooling.

40:35.000 --> 40:37.000
For example, we have connection tooling.

40:37.000 --> 40:39.000
We use PG-bound server for connection tooling.

40:39.000 --> 40:43.000
And we have a fleet of PG-bound server because it's CPU bound.

40:43.000 --> 40:48.000
And for management of our high availability, we use a patrony.

40:48.000 --> 40:51.000
And patrony is a post-blastware high availability toolkit.

40:51.000 --> 40:54.000
And it's also controlling the PG-bound server.

40:54.000 --> 40:59.000
There's a centralized single source of trust in our case console.

40:59.000 --> 41:05.000
And if we do the switch over, we basically just say, hey, patrony, switch over to this source.

41:05.000 --> 41:11.000
And patrony tells PG-bound server, hey, post-connections here, and redirects them to the other ones.

41:11.000 --> 41:12.000
Awesome.

41:12.000 --> 41:19.000
If you want to take a look at our Ansible, you see where we integrate, where we hook into the systems.

41:19.000 --> 41:20.000
Awesome.

41:20.000 --> 41:21.000
Thank you.

41:21.000 --> 41:23.000
Do we have a mic here?

41:23.000 --> 41:24.000
Hi.

41:24.000 --> 41:26.000
Oh, you have new people.

41:26.000 --> 41:27.000
Sorry.

41:28.000 --> 41:30.000
Thank you for the talk.

41:30.000 --> 41:33.000
I was really surprised to hear about that collation issue.

41:33.000 --> 41:35.000
I hadn't heard of that before.

41:35.000 --> 41:41.000
I wondered if postgres had ever considered dealing with collation itself somehow

41:41.000 --> 41:44.000
to make this whole upgrade process easier.

41:44.000 --> 41:45.000
Yeah.

41:45.000 --> 41:47.000
Is that ever been considered or?

41:47.000 --> 41:48.000
Yeah, definitely.

41:48.000 --> 41:53.000
First of all, I'm quite lucky that I know what a collation issue for quite a long time.

41:53.000 --> 41:58.000
Because some years ago, I worked for Postgres with Consulting Company and the collation for the German umlauded.

41:58.000 --> 42:04.000
Something like, you changed before, so we had a lot of German customers who had this problem like many years ago.

42:04.000 --> 42:09.000
And now it's biting more people because the collation for different characters was changed.

42:09.000 --> 42:12.000
And yeah, there are new and your Postgres version, there are different methods.

42:12.000 --> 42:15.000
Back in the day, you were bound to use the system-wide collation.

42:15.000 --> 42:17.000
But now you can use different collations.

42:17.000 --> 42:20.000
So there are, I see you.

42:20.000 --> 42:22.000
There are other collations.

42:22.000 --> 42:24.000
They can say, hey, Postgres deal.

42:24.000 --> 42:26.000
Ignore the system collation used this one.

42:26.000 --> 42:31.000
But it also comes with caveats to sum it up in one sentence.

42:31.000 --> 42:32.000
Okay.

42:32.000 --> 42:33.000
Yeah.

42:33.000 --> 42:34.000
Thank you.

42:36.000 --> 42:41.000
The default is that it uses the collation of your G-lipsy library, your system-sy library.

42:41.000 --> 42:43.000
And that's where the problem comes from.

42:43.000 --> 42:48.000
Because if you're making an operating system upgrade, anything changes to G-lipsy,

42:48.000 --> 42:51.000
you are about to find interesting behavior.

42:58.000 --> 43:05.000
Before the switchover, how did you distinguish the read and the write queries?

43:05.000 --> 43:08.000
I did put it a bit closer to the most before the switchover.

43:08.000 --> 43:13.000
Before the switchover, you had the read and write queries.

43:13.000 --> 43:17.000
How did you distinguish which or read and write queries?

43:17.000 --> 43:18.000
Awesome question.

43:18.000 --> 43:23.000
As before the switchover, we first switch over the write queries and then the read queries.

43:23.000 --> 43:24.000
How did we do it?

43:24.000 --> 43:30.000
And that's, unfortunately, for most of you, that's built in G-lip feature.

43:30.000 --> 43:35.000
So in G-lip the application in our Ruby on Redspack and it knows that certain queries

43:35.000 --> 43:38.000
will be read and write and certain queries will only read.

43:38.000 --> 43:43.000
So it connects to two pools, one read, read, write pool, which only talks to the primary.

43:43.000 --> 43:48.000
One read pool where we have a lot of stand-by and the Ruby application talks to all the stand-by.

43:48.000 --> 43:53.000
And the Ruby on Reds decides where to send it.

43:53.000 --> 43:55.000
Okay.

43:55.000 --> 43:56.000
Hi.

43:56.000 --> 43:59.000
Here.

43:59.000 --> 44:04.000
So can you explain a bit more how you the upgrade worked?

44:04.000 --> 44:08.000
Actually, the distribution upgrade, I didn't quite catch that.

44:08.000 --> 44:10.000
For the test, you said you cloned it.

44:10.000 --> 44:15.000
So I guess, do you have compute and storage separate?

44:15.000 --> 44:22.000
So you could just cloned with the 22 or did you do a distribution upgrade?

44:22.000 --> 44:28.000
Or did you just provision empty VMs and then re-cloned or re-sync everything?

44:28.000 --> 44:29.000
Okay.

44:29.000 --> 44:33.000
How do we get to the target cluster?

44:34.000 --> 44:39.000
Okay. We create new virtual machines with Ubuntu 2022.

44:39.000 --> 44:50.000
And we take snapshots from production and we create new machines from snapshots in production.

44:50.000 --> 44:52.000
Storage snapshots, sorry.

44:52.000 --> 44:57.000
So we run on GCP and we take a storage snapshot and then create machines from the snapshots,

44:57.000 --> 45:02.000
which with our currently 48-tera by disks that takes like half an hour,

45:02.000 --> 45:06.000
but it's way faster than restoring a PG-based backup.

45:11.000 --> 45:13.000
I have two questions.

45:13.000 --> 45:14.000
Yeah.

45:14.000 --> 45:18.000
One is how much effort you put into building this automation?

45:21.000 --> 45:24.000
The first iteration of it was more than half a year,

45:24.000 --> 45:29.000
and now we are working for it since 2022 or something.

45:29.000 --> 45:32.000
So we iterate over years over it.

45:32.000 --> 45:35.000
So it's a lot of work.

45:35.000 --> 45:41.000
Because if you make the wrong, there are certain error cases here that brings down time.

45:41.000 --> 45:44.000
And there are error cases that would bring data corruption.

45:44.000 --> 45:47.000
So you have to test it really thoroughly.

45:47.000 --> 45:50.000
And if we do test on our production, it takes a multi-day test.

45:50.000 --> 45:52.000
So yeah, there's a lot of effort into that.

45:52.000 --> 45:56.000
And so it's years, how big of a team do you have?

45:56.000 --> 45:58.000
Currently, it's five people.

45:58.000 --> 46:02.000
And from the majority of our work last year, went into that.

46:02.000 --> 46:04.000
But we have some fluctuation in the team,

46:04.000 --> 46:06.000
so if it would have been a standing team,

46:06.000 --> 46:09.000
maybe we would have had a bit more headroom for other things.

46:09.000 --> 46:11.000
But it's multiple people per year,

46:11.000 --> 46:14.000
if you have a really large system like we have.

46:14.000 --> 46:19.000
What the end of the second question is, would this be reusable outside?

46:19.000 --> 46:20.000
Yes.

46:20.000 --> 46:24.000
So I presented a very oversimplified architecture.

46:24.000 --> 46:29.000
And this architecture you can use for a lot of different use cases.

46:29.000 --> 46:32.000
Our automation is made for us.

46:32.000 --> 46:33.000
It's open source.

46:33.000 --> 46:34.000
You can clone the repo.

46:34.000 --> 46:35.000
You can look into it.

46:35.000 --> 46:40.000
But you most likely would not be able to use our ends of the playbooks one to one.

46:40.000 --> 46:43.000
But the concept for sure.

46:43.000 --> 46:47.000
But to repeat that, if you don't need zero downtime,

46:47.000 --> 46:49.000
if you can take an hour or so of downtime,

46:49.000 --> 46:52.000
I would optimize for one of the simpler approaches

46:52.000 --> 46:54.000
to optimize it to get minimal downtime.

46:54.000 --> 47:00.000
And only go to this endeavor if you have a hard requirement for like zero.

47:00.000 --> 47:01.000
Thank you.

47:01.000 --> 47:03.000
Here we go.

47:03.000 --> 47:06.000
I've got a question over here.

47:06.000 --> 47:10.000
All right.

47:10.000 --> 47:12.000
So before you've mentioned that,

47:12.000 --> 47:16.000
most of the services of GitLab are run on Kubernetes,

47:16.000 --> 47:19.000
I've already used the way to the site.

47:19.000 --> 47:22.000
And I've wanted to ask what exactly red is.

47:22.000 --> 47:23.000
It's useful.

47:23.000 --> 47:27.000
It's mostly for cache and job processing or something.

47:27.000 --> 47:29.000
Why are there use cases on red is?

47:29.000 --> 47:30.000
Okay.

47:30.000 --> 47:33.000
First this play, my focus is postgres VR.

47:33.000 --> 47:37.000
So for my understanding, we use red is for application caching.

47:37.000 --> 47:39.000
So Ruby on red's catcher stuff and red is.

47:39.000 --> 47:42.000
So it doesn't have to create a database for something,

47:42.000 --> 47:43.000
it just asks it.

47:43.000 --> 47:45.000
And for full disclosure,

47:45.000 --> 47:50.000
I didn't thought too much about Githli, Githli is our back end for storing actually

47:50.000 --> 47:51.000
Githli data.

47:51.000 --> 47:54.000
That's also not in Kubernetes at the moment.

47:54.000 --> 47:55.000
It can run.

47:55.000 --> 48:00.000
We have some use cases for that, but for Githli.com, it still runs on virtual machines as well.

48:00.000 --> 48:02.000
Okay, last question here.

48:02.000 --> 48:06.000
What is the procedure for minor version upgrade?

48:06.000 --> 48:07.000
Is it the same?

48:07.000 --> 48:09.000
No, for minor versions upgrade, minor version upgrades.

48:09.000 --> 48:12.000
You don't have the data does not change.

48:12.000 --> 48:15.000
So, for minor version upgrades, we just create new stand-byes,

48:15.000 --> 48:18.000
put the new minor version, put them into the load balancer,

48:18.000 --> 48:20.000
and then fade out the old ones.

48:20.000 --> 48:24.000
And if only the primary is left for only the old version,

48:24.000 --> 48:27.000
we have to do a switch over, which as we have seen,

48:27.000 --> 48:29.000
we can do without noticeable user impact,

48:29.000 --> 48:32.000
then we switch over and decommission the primary.

48:32.000 --> 48:35.000
So, it's a bit tedious because we have to create all these new nodes,

48:35.000 --> 48:38.000
put them in the load balancer, put the old ones out,

48:38.000 --> 48:42.000
but it's something we do just during normal working hours without

48:42.000 --> 48:46.000
a fraction of this preparation effort here.

48:50.000 --> 48:53.000
We've got one more question here, time for it.

48:53.000 --> 48:56.000
Hi, thanks for the talk.

48:56.000 --> 48:58.000
A small question.

48:58.000 --> 49:02.000
I assume that GCP has one more or another of the managed postgres.

49:02.000 --> 49:05.000
What's your considerations for not using that?

49:05.000 --> 49:07.000
Is it okay?

49:07.000 --> 49:11.000
Okay, the question is GCP itself offering for managed postgres here.

49:11.000 --> 49:14.000
Okay, I hope I don't step on somebody's toes here,

49:14.000 --> 49:18.000
but the GCP offering is like an 80% offering,

49:18.000 --> 49:21.000
like for to catch like a lot of people who want to have

49:21.000 --> 49:23.000
just a postgres skill instance,

49:23.000 --> 49:27.000
but for our scale, it's unfortunately not possible.

49:27.000 --> 49:30.000
I guess that's good sum up.

49:30.000 --> 49:32.000
Thank you.

49:32.000 --> 49:33.000
Thank you.

49:33.000 --> 49:36.000
Thank you.

