WEBVTT

00:00.000 --> 00:13.440
So, our next talk is by an MN person, usually the people that make our lives difficult.

00:13.440 --> 00:14.440
I'm joking.

00:14.440 --> 00:18.840
But he's going to talk about new updates in a slump world.

00:18.840 --> 00:23.600
Yes, so I live on my name is Flostimel.

00:24.160 --> 00:31.360
I work at Susanne Kernel Court team for, I think, close to 12 years now.

00:31.360 --> 00:37.080
And at some point, became a maintainer of the slump allocator.

00:37.080 --> 00:40.720
And also later, commentator of the page allocator.

00:40.720 --> 00:47.520
And one of the first thing I did after starting maintaining slump is you might remember

00:47.520 --> 00:53.880
there were three of those initially to sell like them at the conflict time.

00:53.880 --> 01:02.000
And I removed two of them and so now there's only a slump with the U.S.L.U.B.

01:02.000 --> 01:11.320
So what I was looking at, what to do next, I had an idea that, because I was seeing some

01:11.320 --> 01:17.560
subsistence developing own caches on top of Slap and not allocating directly for various

01:17.560 --> 01:21.000
reasons I want it to look into this.

01:21.000 --> 01:25.840
Some of those like the IO layer do it for performance reasons, because they think

01:25.840 --> 01:33.480
they own arrays are faster than Slap, VPF had another reason that he wanted to allocate

01:33.480 --> 01:41.280
objects in context that isn't Slap, Slap is in supported by Slap, such as an MI handler,

01:41.320 --> 01:52.000
MIPO-3 is a new thing in the MMI for VMA management and the problem with it can start

01:52.000 --> 01:58.800
some operation and to rewrite the three and now with the two allocator objects and it cannot

01:58.800 --> 02:07.000
fail during that phase and also cannot free memory so what it did is pre-allocate for the

02:07.000 --> 02:16.400
worst case and then free the unused object bags back which is not very efficient.

02:16.400 --> 02:25.160
So I wanted to do something with the Slap allocator to support these U.S. cases so that people

02:25.160 --> 02:30.440
don't have to write, work around, surround it and this is also something that wouldn't

02:30.440 --> 02:35.600
be possible if we had still three allocators because these improvements would have to be

02:35.600 --> 02:43.320
done in all of them so settling on just a single allocator helped in this case and hopefully

02:43.320 --> 02:52.960
by doing that we would also improve the performance of the Slap allocator by combining

02:52.960 --> 03:01.040
the best of designs of the remaining allocator and the Slap with A that was deleted.

03:01.040 --> 03:06.200
So I need to spend some time explaining how they work.

03:06.200 --> 03:13.920
So Slap allocator you have it's allocating objects that are smaller than a page so it

03:13.920 --> 03:21.560
starts with a page and it slides to set up to a small objects of the same size and some

03:21.560 --> 03:26.800
of them would be allocated or free at any given moment and there's some management structure

03:26.800 --> 03:34.960
that remembers which of those objects are free and how many of them are there and how

03:34.960 --> 03:36.680
many are used.

03:36.680 --> 03:43.240
There's already a difference between how the SLAB worked and the SLUB works.

03:43.240 --> 03:51.400
SLUB has this embedded free list in the objects that are currently free which makes it

03:51.400 --> 04:01.720
possible to do some atomic operations like allocating or freeing one or multiple objects

04:01.720 --> 04:06.040
at once in a single atomic operation.

04:06.040 --> 04:12.720
But you need more than one SLUB because there are many objects so they are usually linked

04:12.720 --> 04:14.720
on some per-numanoid structure.

04:15.720 --> 04:21.840
There might be multiple lists because we distinguish slops that are free, that are partially

04:21.840 --> 04:30.240
full or fully full and there will be some spin lock to protect this structure and this

04:30.240 --> 04:36.880
list of Slap from concurrent modification and if you only have this it wouldn't really

04:36.880 --> 04:42.720
scale well because every operation would take the spin lock and multiple CPUs would

04:42.720 --> 04:43.720
continue it.

04:43.720 --> 04:50.560
So, the allocator always has some kind of per-CPU cacheing layer and it is now that this

04:50.560 --> 04:57.600
is much simpler to implement for allocating than for freeing for different reasons.

04:57.600 --> 05:04.600
The old SLUB had basically array of objects that were allocated from the SLUB but were put

05:04.680 --> 05:13.320
into this array and not yet given out to the consumers and when this array got full it

05:13.320 --> 05:21.360
was flushed when it was empty it would be refilled from the Slap from the slabs and this

05:21.360 --> 05:29.560
immortalizes the operation with the spin lock you allocator free multiple objects at once

05:29.800 --> 05:33.280
and it proves performance.

05:33.280 --> 05:40.080
The new Slap with you has the different kind of scheme where each CPU has completely

05:40.080 --> 05:45.360
private Slop that it can operate on.

05:45.360 --> 05:51.600
It turns out that it wasn't enough to have just once so later there was at least more

05:51.680 --> 06:01.280
of them that it could use when the main one got depleted and this allows some nice fast

06:01.280 --> 06:09.680
parts, again do some compare extension operation that doesn't even have to be atomic

06:09.680 --> 06:19.680
because it's on the same CPU on the local CPU only but for freeing it only works if you're

06:19.760 --> 06:26.400
freeing object to the same Slap that's currently on the same CPU installed as the CPU

06:26.400 --> 06:33.440
cache Slop which turns out isn't often the case because you might have allocated object

06:33.440 --> 06:41.600
some time ago or migrated to different CPUs so the fast part was annotated as likely but

06:41.600 --> 06:46.640
it was actually alive because it wasn't all that likely to happen because there are many

06:46.720 --> 06:53.920
slaps and the chance that this particular Slop is on the same CPU is quite low and the

06:53.920 --> 07:02.800
Slop part is this more expensive atomic operation it's not bad but it's not as good at

07:02.800 --> 07:14.400
the fast part so the idea of SHIFT is to combine basically these two approaches so we would

07:14.400 --> 07:24.960
again have fixed capacity array of objects for cache them on given CPU and also some

07:24.960 --> 07:36.160
pernumanoat bar that can cache these shifts between different CPU and yeah I do I think

07:36.160 --> 07:42.320
to make you well cooked for inventing these terminologies so we don't have to use the magazine

07:42.880 --> 07:54.320
worked this military connotation and yeah so this avoids the issues of the old Slop design

07:55.120 --> 08:10.240
by using the current Slops slow path for freeing remote objects which which turns out it

08:10.240 --> 08:17.840
happens only in maybe 5% of the case is but we have improved the fast part and make it more likely

08:19.040 --> 08:27.600
by just putting the object to a per CPU array and not relying on the same Slop being on the same

08:27.600 --> 08:35.360
CPU there's also a usage of new looking primitive local trial octid that makes the

08:35.440 --> 08:48.480
looking of these per CPU sheets less expensive than the usual local local iRQ safe so that's

08:48.480 --> 08:58.240
again a improvement and for the maple tree use case that was one of the motivations we also have

08:58.240 --> 09:08.160
an API where the maple tree can ask for a shift of at least given amount of object because it

09:08.160 --> 09:16.960
can calculate this worst case estimation before starting the critical operation and in the ideal

09:16.960 --> 09:23.280
case we just give it the shift from the CPU it will only use as many objects that it needs

09:23.280 --> 09:29.840
to rate the critical operation those allocations cannot fail and then ideally it returns the

09:29.840 --> 09:37.360
shift back if you just put it on the CPU and it has only few less objects than before so this

09:37.360 --> 09:45.520
is much cheaper than pre-allocating and returning a bunch of objects we can also optimise pay

09:45.520 --> 09:58.000
frrsu which is freeing objects with rsu grace period by creating another per CPU sheave that collects

09:58.880 --> 10:07.520
the subjects that were subjected to k3rsu and only once it's full we use the rsu machinery to process

10:07.600 --> 10:15.280
the whole sheave and then reuse it again without dealing with the individual objects as

10:16.160 --> 10:23.760
was done before so that should also improve the performance so that's roughly the design of the

10:23.760 --> 10:33.040
sheave's in slope allocator now the status in fix 18 the implementation was merged and there were

10:33.120 --> 10:41.520
only two opt-in users that we looked at initially the VMAs and the maple tree nose because they

10:41.520 --> 10:49.920
use the extra features and luckily the kernel didn't explode and nobody reported some crazy

10:49.920 --> 10:56.800
regressions so that first step was okay but but it's not ideal to stop there because

10:57.440 --> 11:02.880
the introduce a new caching layer but it's opt-in and there's still the old caching layer as a fallback

11:03.920 --> 11:10.880
so that's more cold and I like to have less cold and it's also unclear which caches

11:11.760 --> 11:20.800
would like qualify to start using this I guess everyone would want to if it improves performance

11:21.760 --> 11:27.760
so of course the next step is to go all the way and enable this for all the caches and

11:27.760 --> 11:43.280
repout the previous caching layer so that's the that's what the current series does that's now in Linux

11:43.360 --> 11:53.360
next and since much window starts soon hopefully nothing will explode and it would go in so the

11:54.080 --> 12:02.320
conversion isn't wasn't anything special there would just some fun with bootstrapping

12:02.320 --> 12:08.640
do thing and preventing recursion because she's are allocated by k maloch which is

12:09.520 --> 12:18.400
slope allocation from specific caches for each object size and we want the k malochesh is to use

12:18.400 --> 12:25.840
chief so there's a chicken and egg problem but this is also available and it has been similar

12:25.840 --> 12:36.320
problems have been there before and were solved one question was how to how to select the chief size

12:36.400 --> 12:45.280
because the larger it is the the better it abortizes the operation but you potentially waste more

12:45.280 --> 12:51.840
memory if if not nobody allocates the objects that are sitting there so initially what I try

12:51.840 --> 13:00.400
is to size them in a way that would be similar to how to how the cpu and per cpu partial

13:00.400 --> 13:10.640
caches or slabs were used so so roughly the same amount of objects would be cached as a result

13:11.520 --> 13:20.240
and hopefully the automatic kernel bots would then not report any unexpected changes in

13:20.240 --> 13:27.440
performance just because we are caching different amount of things so it would look like the

13:27.440 --> 13:34.960
new code maybe as a problem but it's just a tuning issue so I try to be as close to the original

13:34.960 --> 13:44.320
as possible but it's of course possible to tune it later once the thing is in and then the fun

13:44.320 --> 13:55.840
part was removing lots of code of the old implementation and the per cpu caches and partial slabs and

13:57.600 --> 14:04.240
not all of that was removed because the freeing slow part it's still there and it's very useful

14:04.960 --> 14:10.560
I also had a lot of fun with debugging the memory that came up in benchmarking

14:11.600 --> 14:21.040
and even wasn't found by Chris Mason's AI power review which is otherwise quite useful but

14:21.040 --> 14:27.600
it didn't find it but after he fixed the prompt by after I reported it he says now it should

14:27.600 --> 14:35.360
be found with i probability of course the question is what about performance and I don't have any

14:35.360 --> 14:43.440
great answer in terms of graphs because because some of the bugs and leaks were fixed quite

14:43.600 --> 14:51.680
last minute and I couldn't get fresh numbers in time so it would be a waste of time to present

14:51.680 --> 15:00.480
anything but but initially it was very far that yes the new fast parts are slightly faster than

15:00.480 --> 15:08.240
the old fast part because the looking is cheaper and yeah then it's about tuning and heavy sticks

15:08.400 --> 15:18.160
and out any particular workload happens to benefit or not from what it's doing so

15:19.600 --> 15:29.520
overall I would say that things were slightly faster or same or in some workloads slightly slower

15:30.800 --> 15:37.120
and there are some regulations reported by the channel to store that we were looking into and

15:37.200 --> 15:46.560
it turns out that yeah it wasn't as bad as it initially looked because it was testing some

15:46.560 --> 15:58.160
commits in isolation but basically the trade-offs here are that the very CPU slot and we allocate

15:58.240 --> 16:08.800
from it all the objects are from the same page so it's better for the LB overhead but it's

16:08.800 --> 16:14.960
worse for cash performance because some of them might be called and with the Per CPU she's

16:14.960 --> 16:22.720
which are the arrays where you allocate in and free in a lethal order the objects you most recently

16:22.800 --> 16:30.080
free will be the one that you will be allocated next so if it was cash-hot it's likely it would still

16:30.080 --> 16:40.240
be cash-hot so that's better for this and yeah I think the biggest benefit should be the

16:40.240 --> 16:48.640
freeing which should now be hitting the fast but much much more often than the original slot

16:48.800 --> 16:59.280
and there are also other aspects why we want to implement like push this to the kernel sooner

16:59.280 --> 17:09.040
than later because as I was saying one of the motivations was for BPS but before I envision that

17:09.040 --> 17:16.160
she should be part of the solution for BPS so it can allocate and in my context mean why I like

17:16.240 --> 17:25.040
to say the PF maintainer came up with a lot of knowledge that can do that without sheeps and it was

17:25.040 --> 17:31.840
actually much at the same time as sheeps but there are a number of limitations where it cannot be used

17:31.840 --> 17:40.000
in preempt RT and it turns out that the sheeps when we replace the old CPU cashing scheme it actually

17:40.720 --> 17:48.320
can remove those limitations because the operations are conceptual is simpler so that's

17:49.360 --> 17:56.400
that's why I hope that it will be much in the next version of Windows so thank you