WEBVTT

00:00.000 --> 00:15.600
Let's continue. Up next, we have himadry Chia Shailesh, I hope that's reasonably close.

00:15.600 --> 00:21.360
Who has done a PhD recently, has recently graduated and has been looking into the

00:21.360 --> 00:28.440
fun topic of parallelization in context of virtualized environments, high level description

00:28.440 --> 00:35.440
I guess. In particular, then here in the GCC OpenMP implementation and

00:35.440 --> 00:41.440
Libcom, the corresponding run time library. All right.

00:50.440 --> 00:51.440
Can you hear me?

00:51.440 --> 00:52.440
Yes.

00:52.440 --> 01:00.240
Okay. Perfect. So, hello. I'm himadry. And I recently defended a PhD, my supervisor is

01:00.240 --> 01:05.440
very Sean P. Allazi and Shulia LaWal. By the way, Shulia is present in the room. And

01:05.440 --> 01:10.120
the previous speaker has put me in a challenging spot to show as much excitement about

01:10.120 --> 01:15.440
this work in front of my supervisors. Thank you for that.

01:15.440 --> 01:24.440
Right. So, I right now work with the Krakow esteem, in general, ENP as a postdoc. So, this is

01:24.440 --> 01:28.440
something that I did at Visper, in real Paris.

01:28.440 --> 01:35.440
Okay. So, first I'm going to give a little bit of context of this work. Then a quick

01:35.440 --> 01:43.440
background to introduce the thesis. And finally, I will iterate over what are the contributions

01:43.440 --> 01:49.440
of the thesis. And we will focus on the parts that touch LibQMP, which is run time library

01:49.440 --> 01:59.440
provided by GCC. Okay. So, the context of this work is at the three out of parallelization,

01:59.440 --> 02:05.440
scheduling and virtualization. And to be more specific, I am addressing parallelization,

02:05.440 --> 02:12.440
achieve through LibQMP. So, the OpenMP implementation of GCC, scheduling in terms of lean

02:12.440 --> 02:19.440
internal. So, right now, we have EEVDF scheduler. And for virtualization, I am using KMU,

02:19.440 --> 02:29.440
KVM, hypervisor stack. Okay. Maybe most of you would be familiar with what is OpenMP,

02:29.440 --> 02:37.440
but in case someone needs the definition. It's basically an API that lets you conveniently

02:37.440 --> 02:46.440
convert your code into run your code using parallelization. So, it provides your compiler

02:46.440 --> 02:52.440
directors, library routines, and you have environment variables. And you can do one

02:52.440 --> 02:57.440
stuff with it, and the compiler in the library does a lot of heavy lifting for you. It

02:57.440 --> 03:04.440
also gives the programmer a lot of control about how to achieve the parallelization.

03:05.440 --> 03:12.440
Okay. The very basic idea is suppose you have a loop, and if you are going to do the works

03:12.440 --> 03:19.440
sequentially, then you are going to do the same work in every iteration. And then if you are

03:19.440 --> 03:26.440
using OpenMP, then you will write something like Pratma or MP4, and then that is going

03:26.440 --> 03:32.440
to help you parallelize the loop. So, the application programmer just needs to write the

03:32.440 --> 03:39.440
Pratma directive behind the scene. OpenMP does a lot of heavy lifting by it. We start

03:39.440 --> 03:44.440
with how many threads to use, then you have to fork this threads, and then you have to

03:44.440 --> 03:50.440
distribute the work to this threads, then you have to start the work, then you have to

03:50.440 --> 03:56.440
synchronize the end of the work. And once you are done, you need to terminate the threads.

03:56.440 --> 04:04.440
So, this is done by the library. And as I mentioned that you create threads. And for

04:04.440 --> 04:10.440
this, OpenMP uses fork showing model. So, there is a master thread, and then that takes

04:10.440 --> 04:17.440
care of creating a team of worker distributing the work, and the worker synchronized.

04:17.440 --> 04:24.440
The master also makes a decision about how many threads to use. And once that decision

04:24.440 --> 04:32.440
is made, you have to create those threads, and then synchronize and terminate them. You

04:32.440 --> 04:38.440
also, so if your parallel application have multiple parallel regions, then you can also

04:38.440 --> 04:43.440
reuse some existing threads. So, you don't necessarily create new threads every time.

04:43.440 --> 04:50.440
The thread pull is around, and we are most of the times just reusing the threads.

04:50.440 --> 04:56.440
And for synchronization, OpenMP heavily relies on using barriers. In simple jumps,

04:56.440 --> 05:01.440
barrier is a point in execution, where all threads must arrive before anybody can proceed

05:01.440 --> 05:11.440
further. And for various reasons to just give examples of two, all these threads do not

05:11.440 --> 05:16.440
reach the barrier at the same time. Sometimes the work distribution is not perfectly

05:16.440 --> 05:22.440
even, that can slow some threads down. If you are running your application on a DVFS system,

05:22.440 --> 05:26.440
then you are different threads are running at different frequencies. So, your speed of

05:26.440 --> 05:32.440
finishing computation is different on different core, and that could also slow some threads

05:32.440 --> 05:39.440
down. So, there are some early arrivals at the barrier, and then there are some late

05:39.440 --> 05:48.440
arrivals. So, how do we achieve weighting for the late comeers? And we have basically two choices.

05:48.440 --> 05:54.440
We can either spin or we can block, and then there are some euristic-based decisions

05:54.440 --> 06:00.440
currently in LibFUMP, and then there are also some environment variables that lets you

06:00.440 --> 06:08.440
change this behavior. But I would like to point out that there are two barriers that we

06:08.440 --> 06:14.440
care about, or like LibFUMP implementation cares about. So, one kind of barrier is called the

06:14.440 --> 06:20.440
threads dock barrier. That synchronizes the start of any new parallel work. And then there is

06:20.440 --> 06:28.440
another kind of barrier that is team barrier, which synchronizes in team work. So, you would have

06:28.440 --> 06:33.440
a parallel region, and within the region you could be doing different chunks of parallel

06:33.440 --> 06:38.440
works, and you might need to share interim results, et cetera. So, the barrier that gets used

06:38.440 --> 06:44.440
are different, one that marks the regions, and another one that marks the chunks of the

06:44.440 --> 06:52.440
work that is being done. Okay. So, here is like somewhat realistic example. Again, it is

06:52.440 --> 06:59.440
taken from a micro benchmark suit, but I am showing two parallel, sorry, five parallel regions,

06:59.440 --> 07:05.440
and inside each of these regions, there are two loops that we are trying to parallelize.

07:05.440 --> 07:12.440
So, the first pragma OMB parallel that decides how many parallel regions are going to

07:12.440 --> 07:18.440
be there in this application, and then hash pragma OMB before, that marks a loop that, okay,

07:18.440 --> 07:25.440
this loop is inside a parallel region, and it should be executed with multiple threads.

07:25.440 --> 07:32.440
So, what does it look like? And I would like to make a note that I can produce all this

07:32.440 --> 07:37.440
visualizations. Thanks to the schedule of tools, they were presented in fostering 23,

07:37.440 --> 07:45.440
in the kernel developer room. And they work on top of the F-trace infrastructure in

07:45.440 --> 07:49.440
Linux kernel. So, you can generate a scheduler trace that is going to give you a lot of

07:49.440 --> 07:54.440
events that happen in terms of scheduling, and then you can visualize it in terms of like

07:54.440 --> 07:59.440
which thread was running on which core, which thread blocked a lot of things about scheduling

07:59.440 --> 08:06.440
you can learn from this tool. What's going on when we are running an open MPE application.

08:06.440 --> 08:13.440
And then we are talking about threads. So, the impact of scheduling also comes into the

08:13.440 --> 08:18.440
picture, if you want good performance, you want good decisions by the scheduler.

08:18.440 --> 08:25.440
So, very quickly scheduling 101. Inside the Linux kernel, we have a task struct that contains

08:25.440 --> 08:32.440
important information that is related to a task. We treat threads as also like task is a

08:32.440 --> 08:37.440
generic term inside the kernel. From the kernel view, everything is just a task.

08:37.440 --> 08:42.440
So, there is no process thread differentiation that generally we read about in the

08:42.440 --> 08:47.440
textbooks. The CPUs are represented with another kind of data structure called

08:47.440 --> 08:54.440
runcuse. And two terms that I would be using very frequently are preemption.

08:54.440 --> 09:00.440
So, it's just to define it that you stop a task from running on a CPU. That is a preemption.

09:00.440 --> 09:04.440
And you switch between two tasks. That's the context switch.

09:04.440 --> 09:09.440
And we are talking about running open MPE workloads inside virtual machines.

09:09.440 --> 09:13.440
So, then you have dual level of task scheduling. So, your application is working

09:13.440 --> 09:19.440
some threads. Two threads are trying to run on VCPUs. Then there is a guest scheduler

09:19.440 --> 09:24.440
that is facing your threads on VCPUs. Then you have host scheduler that is

09:24.440 --> 09:31.440
placing VCPUs on VCPUs and then the actual execution is spending out.

09:31.440 --> 09:38.440
And we care about what happens to VCPUs because we are running our threads on that.

09:38.440 --> 09:45.440
And the thing that troubles most is that from the host scheduler's point of view,

09:45.440 --> 09:50.440
VCPUs are no special citizens. They can be preempted anytime.

09:50.440 --> 09:56.440
And they can be preempted at that time. And that could have performance impacts on

09:56.440 --> 10:02.440
MPE applications. So, we care about these preemptions.

10:02.440 --> 10:08.440
Okay. So, now we have enough background to, we know the context and we know all the

10:08.440 --> 10:14.440
important terms. So, I can then quickly touch on the motivation for this thesis before

10:14.440 --> 10:21.440
presenting the thesis. Running open MPE applications in cloud, it's commonplace.

10:21.440 --> 10:28.440
Over subscription is also cost-cutting technique that is widely used.

10:28.440 --> 10:33.440
And if you want good performance for applications for your open MPE applications,

10:33.440 --> 10:38.440
you want good scheduling decisions and you want good value synchronization performance.

10:38.440 --> 10:44.440
And current implementation of LIMP is virtualization oblivious.

10:44.440 --> 10:49.440
And most of these choices that impact the scheduling performance

10:49.440 --> 10:55.440
or the better synchronization performance are static.

10:55.440 --> 11:02.440
So, here is where we define whatever thesis is.

11:02.440 --> 11:07.440
Is that we want to combine the scheduling insights at both levels.

11:07.440 --> 11:13.440
And then we want to use this insights to guide the decisions made by LIMP.

11:14.440 --> 11:17.440
And we are particularly targeting two decisions.

11:17.440 --> 11:22.440
How many threads do you use in each parallel region and how to synchronize these threads?

11:22.440 --> 11:28.440
And goal is to improve performance of open MPE applications,

11:28.440 --> 11:33.440
running inside over subscribe virtual machines.

11:33.440 --> 11:39.440
So, the work could be broadly categorized into three major contributions.

11:39.440 --> 11:42.440
The first is phantom tracker.

11:42.440 --> 11:47.440
It is an algorithm that tracks what is happening to your VCP use.

11:47.440 --> 11:52.440
It is most implemented inside the LIMP scanner currently.

11:52.440 --> 11:56.440
It would be much nicer to provide the same algorithm as BFF programs set.

11:56.440 --> 11:59.440
And that is something I would like to do.

11:59.440 --> 12:02.440
And I talked about this earlier in kernel conferences.

12:02.440 --> 12:07.440
So, this information already exist and my thesis would also be soon available.

12:07.440 --> 12:12.440
So, if someone wants to look at the details of this part of the contribution,

12:12.440 --> 12:13.440
that would be available.

12:13.440 --> 12:16.440
I wouldn't be going into detail of it right now.

12:16.440 --> 12:19.440
The second one is the barriers synchronization mechanism.

12:19.440 --> 12:23.440
We call it PV barriers think because we are introducing this par virtualized.

12:23.440 --> 12:28.440
VCP is something that is not a phantom VCP.

12:28.440 --> 12:29.440
Okay.

12:29.440 --> 12:31.440
So, we track phantoms.

12:31.440 --> 12:33.440
We enable communication between the schedulers.

12:33.440 --> 12:36.440
We register our tasks with the guest.

12:36.440 --> 12:39.440
We register the VMs VCP use with the host.

12:39.440 --> 12:43.440
And finally, we compute phantom average.

12:43.440 --> 12:47.440
We compute this metric at the granularity of the scheduler tick.

12:47.440 --> 12:54.440
And that tells the impact of host overload conditions on the execution of our application.

12:54.440 --> 13:02.440
So, we care about the overload impact on us and not the global load average.

13:02.440 --> 13:06.440
And that's why we just don't take what leaves you MP does now.

13:06.440 --> 13:11.440
Because there is an OMP dynamic implementation that takes global load average into account,

13:11.440 --> 13:15.440
but that is not sufficient.

13:15.440 --> 13:18.440
Okay.

13:18.440 --> 13:22.440
The second contribution is motivated by two important academic works.

13:22.440 --> 13:28.440
The first work show that spinning in overloaded scenario is very bad for performance.

13:29.440 --> 13:31.440
And blocking is not great either.

13:31.440 --> 13:37.440
Second work shows that if you are synchronizing threads in an over subscript scenario,

13:37.440 --> 13:40.440
then you should be removing over subscriptions.

13:40.440 --> 13:45.440
So, it's been as many threads as there are CPUs and then blocked remaining ones.

13:45.440 --> 13:47.440
So, we built on top of it.

13:47.440 --> 13:50.440
We introduce par virtualized insights.

13:50.440 --> 13:52.440
So, we know how many VCPs can run.

13:52.440 --> 13:55.440
So, we only want to run as many threads.

13:55.440 --> 13:59.440
Sorry, we only want to use spinning for as many threads and blocked remaining ones.

13:59.440 --> 14:06.440
So, essentially, we are making a decision on part thread basis at each barrier should I block at this barrier.

14:06.440 --> 14:10.440
And that is entirely determined by the runtime conditions.

14:10.440 --> 14:17.440
You don't need to read this, but it's a style for reference if someone wants to look up the slides later.

14:17.440 --> 14:23.440
But this is, we change the do spin behavior in lips you MP.

14:24.440 --> 14:28.440
Okay, and the final contribution which works in three parts.

14:28.440 --> 14:34.440
So, first is that we adapt the DOP based on the information that we are getting from Phantom Tracker.

14:34.440 --> 14:40.440
In addition to Phantom average, we also provide idle average.

14:40.440 --> 14:48.440
And then we are using the PV barrier sync mechanism for team barrier as well as for the threads barrier.

14:48.440 --> 14:55.440
And finally, we have a lightweight task affinity mechanism that we use to guide the guest scheduler.

14:55.440 --> 14:57.440
Okay.

14:57.440 --> 15:01.440
It is more complicated than this, but this is the just.

15:01.440 --> 15:04.440
If you have Phantom's, don't use more threads.

15:04.440 --> 15:07.440
If you are reduce your DOP.

15:07.440 --> 15:14.440
If you see idle course, then you want to, that means you can increase your DOP.

15:14.440 --> 15:20.440
And if you see that okay, we are in a stable situation, don't change anything.

15:20.440 --> 15:28.440
We want to be more conservative when we increase the DOP so that we don't create more Phantom's and create instability.

15:28.440 --> 15:32.440
So, there is like a stability requirement.

15:32.440 --> 15:43.440
We only scale up by the minimum idle PCPUs seen in two takes and then we also scale that window.

15:43.440 --> 15:51.440
And for affinity, we start with putting our application threads on the first NVCPUs because if the DOP is N.

15:51.440 --> 16:02.440
And then later, if we need to shrink the CPU set, we do it once to better guide the PV sync barrier mechanism.

16:02.440 --> 16:08.440
So, there are some scheduling related nitty gritties about why we do this.

16:08.440 --> 16:20.440
And there is a lot of fighting that I had to do with Linux load balancer, but that what is the scope, you know, that you can improve on.

16:20.440 --> 16:23.440
And then quantify your performance on it.

16:23.440 --> 16:35.440
Then we also want to test whether we do well in varying load conditions as well as whether we do well if we scale smaller virtual machine bigger virtual machine as a pro.

16:35.440 --> 16:42.440
Okay, for evaluation, I am using NAS per eventurizing open implementation of it.

16:42.440 --> 16:44.440
And as you can see, it is quite diverse.

16:44.440 --> 16:54.440
Like some applications have lots of decision points to change your DOP and some have lots of barriers and some good mix of boot things.

16:54.440 --> 16:59.440
And for the competing workload, I am using random sequence.

16:59.440 --> 17:02.440
Again, like this graph is generated using the same schedule of tools.

17:02.440 --> 17:06.440
The green line show how many threads are running in the competing VM.

17:06.440 --> 17:13.440
And then we also change like how frequently number of threads change in the competing VM.

17:13.440 --> 17:19.440
So, the load condition like first one is load is changing once in one second.

17:19.440 --> 17:27.440
Another one, I have the step size of 100 milliseconds, so every 100 milliseconds we have different load condition to balance against.

17:27.440 --> 17:32.440
For baseline, of course, using spinning is not using spinning is common sense.

17:32.440 --> 17:37.440
So we are testing again blocking, like block all threads at all values.

17:37.440 --> 17:44.440
And then we are using a fixed number of threads for each parallel region.

17:44.440 --> 17:52.440
And then for our algorithm, I have patches in the host scheduler, guest scheduler and lips you MP.

17:52.440 --> 17:59.440
All right, so on the, yeah, on the right, it's a larger machine, it has 96 VCPs.

17:59.440 --> 18:13.440
And we managed to win from, I think we are going from yet 30 seconds to 10 seconds or 25 seconds to 10 seconds on that machine.

18:13.440 --> 18:19.440
And on the smaller machine, also we are going from 35 seconds to maybe 15 seconds.

18:19.440 --> 18:22.440
So we win a lot on this benchmark.

18:22.440 --> 18:29.440
But the important thing to note is that there are like lots of parallel regions and lots of barriers.

18:29.440 --> 18:34.440
So there is good opportunity for us to apply corrections.

18:34.440 --> 18:44.440
And then we have this where we don't exactly win, because very less opportunity for changing the number of threads and too many barriers.

18:44.440 --> 18:50.440
That our current PV barriers think struggles to do well with.

18:50.440 --> 18:55.440
Okay, so I want to test this more.

18:55.440 --> 18:59.440
I want to fix it in the scenarios where it doesn't work.

18:59.440 --> 19:03.440
Or at least it should not be with BPF as of now.

19:04.440 --> 19:09.440
But we do use this is it when it comes to be there.

19:09.440 --> 19:14.440
And the next question is, is making you the PPF back and for this is it.

19:14.440 --> 19:16.440
Okay, please use this is it.

19:16.440 --> 19:21.440
Okay, I think I can safely say we try to.

19:21.440 --> 19:25.440
Okay.

19:26.440 --> 19:28.440
I'm not sure it's I think.

19:28.440 --> 19:35.440
How do you want to figure out another purchase use is changing more or.

19:35.440 --> 19:45.440
Right, so the question is, how do we figure out the number of this number of runnable VCPUs is is changing and that is by.

19:45.440 --> 19:53.440
Observing the decisions of the host scheduler for the VCPUs and then also enabling hints from the guest scheduler telling that.

19:53.440 --> 19:59.440
Open MP thread is running on this VCPU and then I'm monitoring on the host scheduler what is happening to that VCPUs.

19:59.440 --> 20:04.440
So essentially I'm monitoring the context which is and wake up events.

20:04.440 --> 20:11.440
And I'm measuring that when a VCPU was thrown off from a PCP and when it got back again.

20:11.440 --> 20:19.440
So those are state transitions I take samples and I compute average from the transitions.

20:23.440 --> 20:27.440
And it's a simple pull out don't know or don't understand.

20:27.440 --> 20:32.440
Let's let's say you're in the middle of one of schedule and decision both of you.

20:32.440 --> 20:33.440
I'm in your library.

20:33.440 --> 20:38.440
What happens if you get a schedule at that moment.

20:38.440 --> 20:40.440
So the two scenarios.

20:40.440 --> 20:45.440
The one is that you are.

20:45.440 --> 20:50.440
A worker that is trying to do the work or you are a worker who is trying to wait for the others to arrive at the barrier.

20:50.440 --> 20:52.440
And in both the cases.

20:52.440 --> 21:00.440
There is a standard VCPU preemption parts that the context gets saved and you will be.

21:00.440 --> 21:04.440
You will start from that context when you are scheduled back again.

21:04.440 --> 21:06.440
So that is already happening.

21:06.440 --> 21:10.440
And it does not change anything about the approach.

21:10.440 --> 21:12.440
So preemption.

21:12.440 --> 21:13.440
Yes.

21:13.440 --> 21:16.440
Preemption are inefficiency for other words that is true.

21:16.440 --> 21:17.440
Okay.

21:17.440 --> 21:18.440
Time is up.

21:18.440 --> 21:19.440
Thank you.

21:19.440 --> 21:22.440
Thank you.

21:49.440 --> 21:51.440
Thank you.

22:19.440 --> 22:21.440
Thank you.

22:49.440 --> 22:51.440
Thank you.