WEBVTT

00:00.000 --> 00:17.440
Thanks for the introduction, so for those of you who never heard about sculptors, let me very

00:17.440 --> 00:23.680
briefly summarize it in a few words. So sculptors, this is showcase of the genotOS framework

00:23.680 --> 00:32.160
as a dust stop list and the genotOS framework is characterized by being a component-based operating

00:32.160 --> 00:39.840
system framework which follows a hierarchical distribution of resources and it also has supports

00:39.840 --> 00:46.720
of several microcolors. And as a side note what I also want to add, because we had this in

00:46.720 --> 00:53.200
previous talk, the framework heavily uses asynchronous communication between components

00:53.680 --> 01:03.280
and tries to avoid synchronous IPC wherever possible. So the official sculpt image that we

01:03.280 --> 01:11.760
we publish twice a year that for X86 uses the Nova Microhypervisor, we also have to talk today about

01:11.760 --> 01:20.960
that. It's an older version of the Nova Microhypervisor, which has been modified a bit and for

01:20.960 --> 01:30.400
arm platforms we use a custom kernel, which we also use as a vehicle for test driving new ideas.

01:31.120 --> 01:34.880
So and this is also what this talk will be about. So it's about a custom kernel.

01:37.360 --> 01:43.440
But first, since the talk is about scheduling, let's go a little bit into the user perspective

01:43.440 --> 01:49.760
what happens if you install a component. On to sculptors is the user is faced with the decision

01:50.160 --> 01:58.320
in which scheduling group to run a component. Therefore scheduling groups, driver for latency

01:58.320 --> 02:07.040
sensitive components, multimedia for audio video and stuff like that, default is for basic

02:07.040 --> 02:13.920
test of applications, just balance computing and background, of course for some background

02:13.920 --> 02:21.680
best effort cast. And the scheduling groups are then mapped to scheduling priorities on

02:21.680 --> 02:29.520
underlying kernel. And since most microcolors that are supported using fixed priority,

02:29.520 --> 02:35.280
based scheduling, we have the issue that for instance a broken device driver, the

02:36.240 --> 02:43.200
device drivers from the Linux kernel, broken device driver can lead to starvation of the

02:43.200 --> 02:51.520
lower priorities. So when designing and implementing the genotes custom kernel,

02:52.960 --> 03:00.880
in roundabout 2014, the idea came up to add quotas to a priority-based scheduling.

03:00.960 --> 03:06.880
So by setting a quota for a component, when the component exceeds its quota,

03:08.240 --> 03:17.040
it's scheduling is degraded so basically then it's scheduled on a background in round

03:17.040 --> 03:24.160
roll and manner. This effectively mitigates the broken device driver scenario that I mentioned

03:24.240 --> 03:33.840
before and also nicely enables the hierarchical distribution of quotas. But unfortunately,

03:34.640 --> 03:42.000
downside is that setting these quotas requires tuning for the workload that you're running and

03:42.000 --> 03:47.840
also for the platform you're running on. So finding the values that work for every workload and

03:47.840 --> 03:56.320
a platform is, yeah, it's not possible basically. So it also requires tuning. So for a dash of

03:56.320 --> 04:03.920
wares which you want to run everything on, this is turned out to be not the the best solution.

04:05.840 --> 04:11.360
So let's take a step back. What do we actually want to have? So first of all, we want to have

04:11.360 --> 04:18.080
a fair scheduling on a dash of wares that's very important. So if every component gets a fair

04:18.080 --> 04:27.200
share of the CPU, you prevent starvation. And also some threads need to get a low latency

04:28.000 --> 04:32.240
and other threads don't care much about latency, they just want to do some computing.

04:32.560 --> 04:41.680
What's also important to us is that it's easy for the user to configure. So we want

04:41.680 --> 04:48.720
don't want to confront the user with a lot of decisions how to set scheduling parameters and so on.

04:50.240 --> 04:56.400
And lastly, it should also be robust against misconfiguration. So if you don't get it right,

04:57.360 --> 05:07.840
it should degrade gracefully and not be broken entirely. So actually two years ago,

05:07.840 --> 05:18.320
here at Foster, Stefan and I had the idea to revise the scheduling of the custom kernel and

05:18.320 --> 05:28.480
yeah, how did we do that? So first of all, when you set out to do this, you do some research,

05:28.480 --> 05:35.360
you review what's what's currently the the newest and what's also what's in the literature

05:35.360 --> 05:39.920
or what are the basic things. So let's start with a very simple one. I think everyone should

05:39.920 --> 05:46.720
know round Robin scheduling just to summarize every threat, every component or every tag gets

05:47.360 --> 05:52.160
they take turns, each threat gets an equal share and you have six times less length. That's pretty

05:52.160 --> 05:58.960
easy, that's very fair. But if one thread should get a little bit more than the other,

06:00.080 --> 06:09.200
then you need to add weights. So if you add weights, so you start, then each threat gets a

06:09.200 --> 06:15.920
share of the CPU that's proportional to the weight. As soon as you do this, however, you

06:16.640 --> 06:22.560
can come up with different schemes of interleaving. So in this example here, if we have the

06:23.120 --> 06:28.400
three threats A, B, and C, A has a weight of four, B and C, both have quite of two.

06:29.840 --> 06:36.480
Then we can have this interleaving scheme shown here to the right, so we run A, B, C,

06:37.440 --> 06:43.040
and then two times A. This would be one super schedule, if you say so.

06:44.640 --> 06:51.440
The downside here is that you see the distance between two executions of B differs. So B actually gets

06:52.000 --> 07:00.160
some sort of a varying latency. You can do this more uniformly, for instance, with this schedule

07:00.320 --> 07:08.000
where the distance between two executions of B is equal to last. How this is typically implemented,

07:08.000 --> 07:14.160
it's round based implementation, which has the downside that when you add threats, which you do

07:14.160 --> 07:23.440
when you have a running dynamic workloads, when you add threats, then the number or the number of rounds

07:23.440 --> 07:29.440
that you need for the schedule might change, and then it's not trivial to transition between

07:29.520 --> 07:37.840
these scenarios. So what you found is what I call here virtual time scheduling, which is a nice trick

07:38.640 --> 07:44.240
to come up with similar schedule as we saw before on the way to front Robin, but much easier to

07:44.240 --> 07:50.960
implement. So each threat has a weight and virtual time, and this schedule simply picks

07:52.080 --> 07:56.720
one of the threats, which has the minimum virtual time and the system on the CPU.

07:57.680 --> 08:05.680
And then when it rendered threat, the virtual time of the threat is increased by the real time

08:05.680 --> 08:14.800
that it ran divided by the weight of the threat. So simple idea. So let's look at it in example.

08:15.760 --> 08:22.160
Let's say we run each threat for four units of real time here. So we start, all threats start

08:22.160 --> 08:30.720
with the same virtual time. It's zero, and we first run A for real time units set, and so because

08:30.720 --> 08:37.760
A has threat has weight of four, it's virtual time is only increased by one. So then we take B,

08:39.120 --> 08:46.240
it's virtual time is increased by two, similar for C, and now A has the minimum virtual time,

08:46.240 --> 08:53.200
again we pick it, and let's say we run it again for four real time units, and then after that

08:53.200 --> 08:58.720
we switch to B. So this is kind of random in this example, because it also choose to choose A,

08:58.720 --> 09:09.280
A again, let's say we choose B, and then C again, and now when A is turned comes up again,

09:09.280 --> 09:13.280
we run it twice because after running it for four real time units, it still has the minimum

09:13.440 --> 09:21.200
virtual time, it's selected again. So as you see, we come with a similar, we come up with a similar schedule,

09:21.200 --> 09:30.400
as we saw software for on the way to wrong Robin. So when putting this idea into practice,

09:30.400 --> 09:36.880
of course there are more detailed questions that we have to answer. So one of this is how to assign

09:36.880 --> 09:43.200
the weights to the threats. I mentioned before that with this assignment of quota, so somehow you

09:43.200 --> 09:49.520
need to come up with sensible values. Also you have to determine the length of the time slice

09:51.200 --> 09:58.080
for which each threat is scheduled. Also what do we do when threats are sleeping? And actually

09:58.080 --> 10:04.080
and also the last most important point here is what was our initial requirement, how can we tune

10:04.080 --> 10:11.040
this to achieve better latency or lower latency for some threats and maybe higher latency for others?

10:13.920 --> 10:21.760
There we found two existing, particularly interesting solutions. One is the EVDF, which is

10:21.760 --> 10:28.960
a new replacement for the Linux completely first scheduler, and we found the BVT which stands for

10:28.960 --> 10:36.400
borrowed virtual time, and this is actually what we ended up using when implementing and designing

10:36.400 --> 10:44.960
our new scheduler. There approach that we took here was not to jump in right away and implemented,

10:44.960 --> 10:50.400
but of course we started with some sort of experimentation, simulate some scenarios, play with

10:50.480 --> 10:59.040
parameters just to understand what this scaling scheme does and if check double check whether

10:59.040 --> 11:10.080
this is really what we want to have. So let's start with the weight assignment. So the

11:10.080 --> 11:16.000
proportional share that threat gets in the system is the result of the sum of the weights of threats,

11:16.080 --> 11:23.920
the active threats divided by the weight of the threat. So when we add threats, the proportional

11:23.920 --> 11:34.560
share of the threat actually changes. For instance, if we add more threats to the default scheduling group,

11:34.560 --> 11:41.280
the share of the driver scaling group would shrink. So this is not what we want. We could,

11:41.360 --> 11:48.880
of course we would come up with some reassignment scheme to reassign the weights, but this would

11:49.680 --> 11:53.920
unnecessarily complicate the implementation here. So what we ended up doing is

11:54.880 --> 12:04.560
employing a hierarchical scheduling scheme where we assign a fixed proportional share for the groups,

12:04.560 --> 12:11.200
the four groups that I mentioned in the beginning, and within the groups we decided to use equal

12:11.200 --> 12:16.880
weights for every threat. So every threat in the group, every threat gets the same share,

12:18.320 --> 12:25.360
and this nicely relieves the user from any decision apart from in which scaling group to run a component.

12:28.880 --> 12:34.320
About the time slice length, when you think about that, you can come up with a couple of

12:34.320 --> 12:40.000
different options here. One is that we have showed in the example before, just pick a constant time slice

12:40.960 --> 12:46.800
length. We could also make this threat specific, or somehow derive it from the weight.

12:48.720 --> 12:57.840
One interesting aspect that comes from the BVT scheme here is that it's the idea of eliminating context

12:57.840 --> 13:03.920
switches to the same threat. So in the example that we had before, where a runs two times in the row,

13:03.920 --> 13:10.320
we could also determine the first place that we run for prices long.

13:13.120 --> 13:20.000
And how this is done in BVT is that when we pick a threat to run, we let a catch up in

13:20.000 --> 13:27.200
the virtual time until it has a catch it, catch it up with the minimum virtual time of the other

13:28.160 --> 13:35.040
and then execute it for a constant real time longer. So let's show an example again.

13:36.960 --> 13:48.400
We have the same weights as before. We pick a first, it runs for four real time units and after that,

13:48.560 --> 13:58.880
we pick B. And now B first catches up with the, or yeah, it cannot catch up actually because

13:58.880 --> 14:04.640
C still has the minimum virtual time. So it's the same scenario for B here. So it runs for four real

14:04.640 --> 14:12.720
time units and then for C, we have the situation that now the minimum virtual time is defined by

14:12.960 --> 14:20.800
A. So C runs until it reached the virtual time of one and then we add four real time units.

14:24.080 --> 14:36.560
After that, we similar thing for A and then B can run again. So what we end up here is

14:36.560 --> 14:45.200
on what we see here in the schedule is that in the beginning, when all star or threat start with

14:45.200 --> 14:53.600
the same both at the time, the time slices are a bit smaller, but over some iterations, they get

14:53.760 --> 15:06.400
some bigger chunks of time to execute. So far we, in the examples, we assume that

15:07.280 --> 15:13.440
all threats just want to do computing, but in reality, of course threats sleep, so they could

15:13.440 --> 15:21.440
unready and get ready again, signal by by something. So let's look at some examples here, let's say

15:22.480 --> 15:29.840
A gets unready, B is scheduled and A becomes ready when B is currently executed. So in this scenario, A

15:29.840 --> 15:38.880
would need to wait until it's virtual time is the minimum virtual time system before it can

15:38.960 --> 15:47.200
run again. So there's some latency waiting involved here. Another situation would be

15:48.320 --> 15:56.160
if A gets unready and gets ready after this point where it has a virtual time in the system.

15:59.840 --> 16:07.760
So in this case, we can raise the question, how long to execute A4. Here it's important that

16:07.760 --> 16:13.680
the sleeping time is not taking into account because otherwise if I would sleep for a couple of minutes,

16:13.680 --> 16:18.240
it could also run for a couple of minutes exclusively and that's not what we want, right? So

16:18.960 --> 16:26.720
the approach that's taken here by the BVT is that we just increase the minimum virtual time of A,

16:28.160 --> 16:34.880
or increase the virtual time of A to the current minimum when A becomes ready and effectively,

16:35.040 --> 16:40.320
then we have the proportion of share is maintained among all action threads.

16:41.760 --> 16:46.560
What you also see here in the examples, as I already mentioned, is that there's some latency.

16:46.560 --> 16:53.280
So in the left case, A has to wait until it has the minimum virtual time and in the left case,

16:54.560 --> 17:03.520
if A needs a bit more execution time to finish but it only gets a smaller time slice when it becomes

17:03.520 --> 17:11.760
ready than it has to wait for the executions of BNC here. So how can we tune this to achieve

17:11.760 --> 17:19.120
lower latency? So we could just maybe we could just move some of the later time slices of A

17:20.160 --> 17:29.200
more to the front and this nice trick in BVT, how this is achieved, it's by assigning a warp value

17:29.200 --> 17:37.920
to each thread and with this warp value, the effect is virtual time of a thread is reduced by this warp value.

17:38.720 --> 17:46.880
So as a result, the thread is one hand preferred in the scheduling decision and when it's

17:46.880 --> 17:55.280
scheduled, it also increases the length of its time slice. And in scopes, we actually apply this

17:55.360 --> 18:02.400
idea on the group level, so each group gets a warp value and we then derive the warp state

18:04.000 --> 18:11.520
from the thread within the group with the lowest virtual time. And there's also a time limit for

18:11.520 --> 18:20.720
warping so that for instance a driver thread in the broken driver scenario should not consume

18:20.800 --> 18:25.680
all this warping benefit of the entire group because they are other threads that when they get activated,

18:25.680 --> 18:33.040
they want to get this warping benefit so there's a time limit is thread runs for too long

18:33.040 --> 18:42.400
then it's warping status disabled. And with this basis, I want to hand over to two steps on

18:43.040 --> 19:00.560
where some more practical details for you. Thank you, thank you for waiting.

19:02.800 --> 19:10.160
So in the few remaining minutes, I would like to talk about practical considerations,

19:10.160 --> 19:18.560
so what do you need to consider when using this new conception using our custom kernel and the

19:18.560 --> 19:27.840
short answer is not much because if you are using scalt or as already, then you actually do not

19:27.840 --> 19:37.200
have to change anything within it. So there's no new configuration options or means that both

19:37.200 --> 19:44.480
are the user to choose in between those warp and weight values for scheduling groups and

19:45.120 --> 19:52.800
to decide what scheduling groups to use. In contrast to this, we use predefined scheduling groups

19:52.800 --> 19:58.960
surprisingly those four groups already used on a much higher level in the scalt configuration.

19:58.960 --> 20:06.640
So if you configure scalt and spawn new components, you have to choose in between those components

20:06.720 --> 20:14.160
and they should decide and those scheduling groups already. It's of course a little bit different

20:14.160 --> 20:21.520
if you are using the genotOS framework basically without scalt, if you tailor your own scenario,

20:21.520 --> 20:30.560
then you should keep in mind how those hierarchical priorities are mapped to those scheduling groups.

20:30.640 --> 20:35.040
And this I want to show you next. So in scalt, we have actually two

20:36.960 --> 20:43.680
priority levels that are important. The very first one is spawned by the top level in it,

20:44.320 --> 20:52.880
which creates the static portion of the system and it has of course it's a privileged level

20:53.840 --> 21:00.560
it starts with, so it's priority level it starts with and it has it defines another one

21:00.560 --> 21:10.080
minor priority level where most of the components reside in. And so the very first one,

21:10.080 --> 21:16.960
which is actually the top level in itself and the timer driver, they get this driver scheduling group

21:17.200 --> 21:24.400
and or they decide in the scheduling group of the drivers and the rest in the modern media group.

21:25.040 --> 21:31.520
And one of the important components over here is the runtime in it process which spawns all dynamic

21:33.040 --> 21:40.480
stuff which is going on in the system. So let's have a look at this. Surprisingly this does not

21:40.560 --> 21:51.200
only have four priority levels but the runtime in it process itself already gains the multimedia

21:52.960 --> 22:01.280
scheduling groups and we define a lower priority levels from driver to background underneath

22:01.280 --> 22:08.720
because to be to stay compliant to the other corners which use linear order of priorities

22:08.800 --> 22:17.920
and to fit this scheme. Because priority levels need to be in the order of two we have

22:17.920 --> 22:25.200
three segments which are used actually in scout. So whenever you are using so what you should

22:25.200 --> 22:30.800
take from the slide if you are not using status then you should keep in mind that whenever you

22:31.200 --> 22:39.520
go away from the scheme you might get unexpected results when your components reside in anything

22:39.520 --> 22:45.200
in the queen. So those unused segments actually land in the spec round group again.

22:50.000 --> 22:56.560
Let's showcase something with it. I mean it won't be that fancy like the demonstration before

22:56.560 --> 23:09.440
but we will do our best and when preparing this talk we are trying to showcase all some of the

23:09.440 --> 23:17.360
latency advantages and what we find out is pretty hard to demonstrate this stuff. So we concentrate

23:17.440 --> 23:26.560
on the fairness and what you see here is a simple measure like demonstration example which is

23:27.280 --> 23:36.320
pre-configured to stay in the background, schedule and group and it doesn't use the GPU it just

23:36.320 --> 23:43.440
uses the CPU for this animation and it takes as much as it can from the CPU.

23:44.160 --> 23:54.160
Which you can see in this simple top component over here. So it's nearly eating up the whole CPU

23:54.800 --> 24:04.320
and yeah run as much frames as it can and if I start another one in another group in the default

24:04.640 --> 24:13.360
group then you will see okay it's getting slower the other one is still smooth and we have like

24:14.480 --> 24:22.080
five times of the CPU usage on this newly spawned component which is due to the fact

24:23.040 --> 24:33.760
that the weight of the default group is five and if you compare it to this first one the weight is one.

24:33.840 --> 24:44.240
So it's in a line with the weights and values of both groups and if times left I start another one

24:45.200 --> 24:54.800
but I have to configure it I think. So we just start another one of these components

24:55.760 --> 25:06.720
and do not use the GPU again but the CPU and I will put it on the same CPU number six

25:08.320 --> 25:18.960
and attitude so now we have two components running in this default group and as you can see in

25:19.040 --> 25:28.480
here on the load the background group is not affected it still gets it's 5 CPU share the others

25:28.480 --> 25:40.880
have to split in between so quite simple okay so that's the end of the talk if you want to

25:41.520 --> 25:48.320
try this out everything we have shown here live on the Scott OS system using our own

25:48.320 --> 25:55.520
kernel can be used by just downloading this image and yeah now we are open for questions thank you for

25:56.400 --> 25:57.520
attention

26:07.200 --> 26:12.240
would it make sense to implement some kind of mechanism that would allow the user space

26:12.240 --> 26:21.280
and some in it just some root task to to basically upload a share unit algorithm into the kernel

26:21.680 --> 26:32.160
like what what the reason to Linux kernel can do with ebp yeah

26:33.200 --> 26:43.280
well I have to repeat the question whether it would be fine to upload some

26:44.000 --> 26:54.160
scheduling algorithms into the kernel so well I cannot answer this actually I don't I would

26:54.160 --> 27:00.320
from from my intuition I would say it would get too complicated the mechanism and I do not want to

27:00.880 --> 27:08.160
have it like complicated like this what we think actually about or what we might think about in the

27:08.240 --> 27:14.640
future is to gain some configuration option it's compliant with the rest but gives you more

27:16.400 --> 27:21.280
in three or hands to just configure different scheduling groups and your two your needs

27:22.960 --> 27:29.040
and maybe I'll read the question so so maybe if you already have some sort of hierarchical

27:29.120 --> 27:38.320
scheduling maybe the via using different parameters in different contexts you might you might

27:38.960 --> 27:42.960
be able to add up to different workloads dynamically right maybe yes

27:50.160 --> 27:56.560
how elaborate where these guest inmates like how much work did you do on them

27:59.680 --> 28:10.400
so the question was how to how elaborate were the estimates so for basically it's the weights

28:10.400 --> 28:19.920
and the the more values we can go to the slide so the weights come from experience that we had before

28:19.920 --> 28:25.920
when we used the quarter base scheduling where to note that well if there were a lot of

28:25.920 --> 28:32.480
driver work it's good it's important work if driver's not fast enough to crash so having a

28:32.480 --> 28:41.120
a last weight for for the driver and reserve for that is a good option and so but the

28:41.120 --> 28:50.240
didn't numbers are relatively random right for the warps value it's so maybe yeah a good guest

28:50.320 --> 28:57.680
teammate so what you get basically with this setting is that for incident default group gets

28:57.680 --> 29:05.920
a benefit of 10 milliseconds over the background so when it gets ready the group it can execute

29:05.920 --> 29:13.120
for 10 milliseconds without any interruption of the background scheduling and well if you look at

29:13.280 --> 29:19.920
time slice configurations also that we we have so we have this this this constant fixed lengths

29:19.920 --> 29:31.120
that thread runs ahead of another one that's 5 milliseconds in our configuration so the average

29:31.120 --> 29:36.720
time slice length is around this all of magnitude of 10 milliseconds so this is the idea that we put

29:36.800 --> 29:46.480
into that that the default group here gets benefit of uninterrupted full length time slice right

29:47.040 --> 29:54.240
and you can yeah go on with this multimedia then also gets 10 milliseconds benefit

29:54.240 --> 30:01.600
uninterrupted from default 20 milliseconds uninterrupted from background and and so on so this is

30:01.600 --> 30:23.920
more the the line of thinking behind that yeah so so the question is in a

30:23.920 --> 30:32.000
and multiple setting where the the limit and the the settings are different or or the same right

30:32.800 --> 30:41.520
for example so of course those those parameters are the same the virtual time as you have seen

30:42.080 --> 30:48.720
if a threat is blocking or if you would transfer to another CPU for instance and it would get the

30:49.200 --> 30:56.400
actual last virtual time of this uh scheduler so the scheduler knows what is the recent

30:57.760 --> 31:04.080
lowest virtual time and we would gain it from from that one so know the a threat which comes

31:04.080 --> 31:10.320
becomes blocked doesn't take away its virtual time

31:18.720 --> 31:24.240
if you finish the execution of the driver then probably it's for that one windows here and when

31:24.240 --> 31:32.800
you finish the processing of 5 p.a.d. you have the boost and you place the difficulty and here

31:32.800 --> 31:37.680
I have if you want to play that's for example in single group some of the drivers in the end

31:37.680 --> 31:43.520
should start working to reward for chance short time do you have something like that for

31:44.480 --> 31:55.280
it's hard to repeat the question uh we have a good idea so I just try to answer so

31:56.560 --> 32:01.920
the questions whether we have some some some priority boosting things like that so basically

32:01.920 --> 32:07.040
this this warping is kind of the idea of priority boosting for for certain time right

32:07.520 --> 32:14.000
um we don't have a boosting scheme so we have helping in the system so if you do some

32:14.000 --> 32:24.560
chronic IPC then the the the called threat is running on your time slice but um yeah for

32:24.560 --> 32:29.840
for asynchronous signal and we don't have thing like that but you could imagine to implement

32:29.840 --> 32:33.840
something like that yeah

32:37.040 --> 32:40.240
you

