WEBVTT

00:00.000 --> 00:19.080
Good evening everyone, my name is Yosh, I work at Perkona as a quality engineer and recently

00:19.080 --> 00:23.360
I have been playing around with the GPUs and this session will be regarding what I learned,

00:23.360 --> 00:28.600
how you can partition using mega approach and what things you can do with that.

00:28.600 --> 00:34.520
So it will include setting things up and understanding how to monitor me as well.

00:34.520 --> 00:39.120
So the overview of this session will be like we will see what are the available sharing

00:39.120 --> 00:44.400
methods, then we will see how mix works, what are the concepts of mix that you should

00:44.400 --> 00:50.000
be aware of if you want to partition a GPU and we will also explore a workload.

00:50.000 --> 00:55.080
So for video generation what were the things that happened for MPS as well as mix partitioning

00:55.080 --> 00:56.080
method.

00:56.080 --> 01:01.560
So yeah then we will explore ways in which you can monitor the mixed instances and then

01:01.560 --> 01:05.240
we will conclude with the suggestions.

01:05.240 --> 01:11.400
So what are the ways in which you can share a GPU, there are broadly two types, temporal

01:11.400 --> 01:17.440
and spatial so you can share GPUs with time like a process which is with other processes

01:17.440 --> 01:22.080
and uses a full GPU so that is a time sharing then there is an MPS where you have

01:22.160 --> 01:26.040
an entire GPU which is shared by processes within it.

01:26.040 --> 01:31.400
So they all work at the same time and they share the GPU resources.

01:31.400 --> 01:36.280
It is a software based approach and then there is make the thing that we will be discussing

01:36.280 --> 01:37.280
today.

01:37.280 --> 01:40.200
So make is an actual isolation of your GPU.

01:40.200 --> 01:45.360
So if you have a large GPU you can partition it in a strict isolated manner and run

01:45.360 --> 01:46.720
processes on it.

01:46.720 --> 01:51.600
So it is an hardware based GPU isolation, what do we mean by hardware based?

01:51.600 --> 01:58.200
It is not actually hardware lines they are there but you enable it using the make commands

01:58.200 --> 01:59.200
and all.

01:59.200 --> 02:05.400
So that is what we mean by the hardware based but it is controlled using the CLI.

02:05.400 --> 02:11.840
So this is a broad comparison like how make is different compared to time slicing MPS.

02:11.840 --> 02:18.720
So GPUs split into isolated portions there is no context which overhead because in

02:18.720 --> 02:23.880
time slicing your entire process which is with other processes so memory and compute

02:23.880 --> 02:24.880
is used.

02:24.880 --> 02:31.400
So there is a lot of switch between the processes MPS is a shared pool and there also there

02:31.400 --> 02:37.880
is some context switching so that is one more thing but in make your process is isolated

02:37.880 --> 02:40.280
completely so it does not need any sharing.

02:40.280 --> 02:44.440
So there is no context switch within that particular process.

02:44.440 --> 02:49.080
Also yeah it is a fixed resource setup so you do not have to worry about time you do not

02:49.080 --> 02:54.080
have to worry about other resources hogging your particular resource.

02:54.080 --> 03:01.080
Also there is some limitation you can have 7 small size slices in case of time slicing

03:01.080 --> 03:04.040
that is like a single workload so you cannot do much with that.

03:04.040 --> 03:08.960
You can quickly switch things in MPS you have many processes that can use a single GPU but

03:08.960 --> 03:13.600
in make you have 7 small slices so if you have a workload that you want to distribute

03:13.600 --> 03:18.920
you can have maximum 7 and based on the size of the workload you can adjust that.

03:18.920 --> 03:24.200
So for strict isolation requirement make is best because it also guarantee quality of

03:24.200 --> 03:25.200
service.

03:25.200 --> 03:26.840
Now what do we mean by quality of service?

03:26.840 --> 03:32.920
You are bandwidth your SM utilisation they all are nearly guaranteed in make partition.

03:32.920 --> 03:39.240
This certain you know things which are not completely entered percent guaranteed so but

03:39.240 --> 03:43.240
most of the time it will work as per expectation.

03:43.240 --> 03:46.400
So yeah so how does make look like?

03:46.400 --> 03:52.080
So on the left hand side you can see a GPU it is partitioned into 7 GPU instances.

03:52.080 --> 03:58.480
So make partitions it into 7 instances so this is an example of the smallest slice of a

03:58.560 --> 04:05.600
Megan instance on the right hand side you can see there is a diagram for NVIDIA A100 GPU.

04:05.600 --> 04:09.720
It has 8 slices of memory and 7 slices of compute.

04:09.720 --> 04:16.080
So your GPU is essentially divided into 2 parts there is a GPU instance which is divided

04:16.080 --> 04:20.160
into 2 parts GPU slices and GPU engines.

04:20.160 --> 04:27.600
So GPU slices consist of those memory slices in the above diagram and compute slices.

04:27.640 --> 04:32.720
GPU engines are a separate thing which is allowed to the GPU instance based on the portion

04:32.720 --> 04:36.120
that you partition so that is another thing.

04:36.120 --> 04:42.200
Now each you can see there are 8 memory slice and 7 compute slice they are not using

04:42.200 --> 04:48.640
exactly 8 divisions or 7 divisions it is almost that we will see further how the slicing

04:48.640 --> 04:49.640
happens.

04:49.640 --> 04:57.360
Yeah so the slice hierarchy is like this you partition memory first so you cut memory and

04:57.400 --> 05:01.880
then you assign compute to it it is not the other way around you cannot assign compute

05:01.880 --> 05:04.760
you cannot slice compute first and then assign memory.

05:04.760 --> 05:13.280
So this is a 2 level hierarchy process so you have to strictly follow that steps.

05:13.280 --> 05:15.880
So what are the partitions that we can do?

05:15.880 --> 05:22.840
So on the left hand side you can see the smallest partitions so we sliced 5GB of memory

05:22.880 --> 05:29.480
which is the smallest available partition in the H1 and A1 and RGPU and then we

05:29.480 --> 05:36.760
alerted 1 compute to it so it is the smallest compute with 1G 5GB as an oven enclosure.

05:36.760 --> 05:42.840
There is also another one where you can share the memory pool with multiple compute instances.

05:42.840 --> 05:49.000
So in the middle one the figure H you can see that there is a huge chunk around 20GB

05:49.080 --> 05:56.120
of memory allocated to 4 compute instances so each compute instance X has a separate compute

05:56.120 --> 06:02.440
instance but they share the memory pool so that is one thing and in last you can have

06:02.440 --> 06:08.120
a large GPU instance where you can have large memory and large compute so it is an isolated

06:08.120 --> 06:13.960
instance but it is bigger compared to the smallest size which is figure G.

06:14.040 --> 06:19.720
So what can happen? I mean the smallest size you might have issues with your workload because

06:19.720 --> 06:24.200
it might require more resource so it is not good for that but for a smaller workload it works

06:24.200 --> 06:30.360
really well because you can have 7 of those instances. On the right hand side you have a big

06:30.360 --> 06:37.000
instance but it is not evenly distributed. What do we mean by not even a utilised sorry?

06:37.000 --> 06:41.800
It is not even a utilised. So if you have a workload which uses only 10GB of memory and two

06:42.600 --> 06:48.600
computes you are 10GB of memory and two computes are wasted on the side so there is a potential

06:48.600 --> 06:53.160
on use of the compute. On the middle one where you can see there are multiple compute sharing

06:53.160 --> 06:59.160
a big memory chunk you can have out of memory issues like all the computes are competing for

06:59.160 --> 07:07.640
memory and eventually some processes collapse because of no memory. So what are the overhead

07:07.640 --> 07:12.680
in make? I mean you are getting this full isolation so there must be something you are compromising.

07:12.680 --> 07:21.960
So you compromise on exact compute divisions. So as I said earlier it was around not exact 7 division.

07:21.960 --> 07:29.000
You can see in an example of H100 the smallest slice which is one compute slice has around 16

07:29.000 --> 07:35.560
asms but overall GPU has around 132 asms so it is not exactly divided by 7 you are missing

07:35.560 --> 07:43.400
around 2 asms. So that is one thing and you can also see on the middle one around 60 asms are

07:43.400 --> 07:50.360
available for 3G and 4G which is not exactly half of one 32 it is just half of 120 so you are

07:50.360 --> 07:56.920
living around 12 asms. So yeah there are a couple of things that you compromise if you are trying

07:56.920 --> 08:03.320
to use make so that is something you should you know keep into consideration. Then let us see how

08:03.400 --> 08:10.680
you can create a partition. I mean we have access to a GPU and we list those available

08:10.680 --> 08:16.360
partitions using this command on the left hand side and figure L. It shows you what are the

08:16.360 --> 08:21.880
divisions that are available to you. Using the profile IDs you can create make profiles.

08:23.080 --> 08:27.480
You can see on the steps that first of all you need to enable the make mode.

08:28.120 --> 08:32.760
GPUs are not enabled by default for make. So you have to enable it you don't have to install

08:32.760 --> 08:38.600
a new utilities for the new GPUs they are by default installed. So you just have to enable this mode.

08:38.600 --> 08:44.360
Once the mode is enabled you can create a GPU instance first for the particular GPU. So you can see

08:44.360 --> 08:52.120
from figure K that we have GPU 0 and we are using GPU 0 in the step 2 to create two compute

08:52.120 --> 08:59.960
instance for profile ID 9. So we got two 3G 40 GB profiles using step 2. So we created our

08:59.960 --> 09:06.600
GPU instance in this second step. Now we will create an assign compute instance to it. So here

09:06.600 --> 09:13.240
we are assigning the entire compute instance to it which is supported by those two GPU instance.

09:13.240 --> 09:18.440
So yeah the third step is that but if you want to be even more specific and divide that

09:19.400 --> 09:25.320
GPU instance to an assign multiple compute instance to it you can do that with the small step

09:25.400 --> 09:31.160
at the bottom. You add GPU ID and then you assign particular profiles that you want.

09:32.440 --> 09:39.160
You can also view the listed compute instance and GPU instance using the commands shown.

09:40.920 --> 09:47.160
Okay so after we create what actually it looks like using NVIDIA SML SMI L command you can see

09:47.160 --> 09:52.920
that there are multiple making instances available. You can see that UUID which we will use

09:53.000 --> 09:59.720
to run our process in the further steps. So this is how it looks. On the figure L you can also see

09:59.720 --> 10:04.680
that there are no free instances available. So we are done with the maximum profile that we possibly

10:04.680 --> 10:10.280
can. So yeah after creation this command can help you in identifying which are available and

10:10.280 --> 10:17.160
which are not available for partitioning. Okay so now let us look at the combinations.

10:17.720 --> 10:22.040
So there are multiple combinations apart from the overhead that we saw earlier.

10:22.120 --> 10:27.320
There are certain combinations of we partitioning that can destroy and waste compute and memory.

10:27.320 --> 10:31.880
So we look into that and we will also see which combinations use all the available profiles

10:31.880 --> 10:35.640
slices. So your GPU is at least used completely apart from the overhead.

10:38.120 --> 10:44.200
So yeah this is one example. So you can see this is an H100 GPU with around 80 GB RAM.

10:44.200 --> 10:49.560
So we divided into two portions three compute instance each and 40 GB each.

10:50.440 --> 10:55.080
So you can see there is one compute instance which is wasted which is not loaded to anyone.

10:55.080 --> 11:00.040
So that is one thing that can happen. If you are partitioning things in certain ways you are missing

11:00.040 --> 11:06.600
out on the compute capacity. So that is one thing. It is around one seventh of your GPUs like

11:06.600 --> 11:13.640
14 percent you are simply just wasting. You are not using it anywhere. So yeah that and yeah

11:13.640 --> 11:19.400
for the memory is similar thing. So if you have seven identical instances with small compute

11:20.360 --> 11:26.280
instances you just have 10 GB of memory wasted. So you are simply using seventh eighth of your

11:26.280 --> 11:33.800
GPU capacity. So yeah you can see from here this is taken from the documentation.

11:34.760 --> 11:41.560
Certain combinations were compute misses out. So and if you observe it properly it is around six.

11:41.560 --> 11:46.360
So whenever you are using the combination of six like three compute instance is three compute

11:47.080 --> 11:51.560
three to one compute instance, three one one one. You are missing out on one compute instance

11:51.560 --> 11:57.720
and based on that you are simply just wasting things. Okay so which combinations don't

11:57.720 --> 12:04.600
leave any slices. These are the combinations which I have listed. There are you can also see

12:04.600 --> 12:10.200
in this table that six is missing. So it is not working out for that particular combination.

12:10.600 --> 12:21.080
Okay so yeah this is the same slides I will let it go. Okay so how can we execute things on

12:21.080 --> 12:25.880
our making instance. We created a make instance now we want to run our things on it. So you can see

12:25.880 --> 12:32.680
using the NVIDIA SMI L command you can view multiple available instances. You use those UIDs

12:32.680 --> 12:37.480
after creating your make instance and then simply export it as a variable. So there is this

12:37.800 --> 12:43.000
visible device is variable where you assign your particular make device UID and then run it.

12:43.640 --> 12:47.960
The highlighted one is for the GPU it is not for the make device. So yeah.

12:51.080 --> 12:57.960
Okay so B200 also has similar things like the one we saw earlier for H100.

12:59.720 --> 13:06.680
We are showing this because we will be showing performance of the video generation model on B200

13:06.760 --> 13:11.560
and H200. So these are the partition methods which are available to us for B200.

13:13.480 --> 13:19.720
Okay so for the good part like video generation. We have around two models that we will be testing

13:19.720 --> 13:25.160
things. 14.1 which is a very large video generation model. It uses prompt image and audio

13:25.160 --> 13:31.480
to generate a video. There is another one which is text to video generator which is a smaller model

13:31.560 --> 13:36.760
of 5 billion parameter. Well it is not small but for video generation contact it is like a smaller one.

13:38.360 --> 13:45.160
So yeah we have also kept an observed the peak and constant V RAM utilization for this. So they

13:45.160 --> 13:54.040
are mentioned over here. Okay so what did we observed? Like the baseline compute is for H100 GPU.

13:54.040 --> 14:00.760
At the end there is an observation where we simply run make instances on B200 and show it along with

14:01.640 --> 14:08.440
this. So as you can see in certain workloads it takes some time and you are only able to run one of them.

14:09.880 --> 14:15.320
While in other workloads which is on B200 you are able to run multiple models in parallel.

14:16.120 --> 14:23.080
So I will show you even detailed one the next slide. So yeah so how many images were we able to

14:23.080 --> 14:31.480
generate? Like using the MPS method isolation was not guaranteed but it it worked. It worked around

14:31.480 --> 14:36.360
two instances and three staggered manner. So we run one process then we wait then for one minute

14:36.360 --> 14:43.640
then another then again we run the third one. So it worked in that manner. For the make part

14:44.200 --> 14:51.400
we had guaranteed isolation for H100 and B200. So there was no need of you know doing things but

14:51.480 --> 14:58.520
in the case of larger GPU B200 we might assume that we will have a lot more slices available but

14:58.520 --> 15:04.440
that's not the case. The slices combination that we have are quite more than we require in terms

15:04.440 --> 15:11.160
of memory. So in case of 5 billion parameter model for B200 make we are wasting around 10

15:11.160 --> 15:17.400
GB of memory for generating one model in parallel. So that's certain combination where things don't

15:17.480 --> 15:24.760
work out but that is one more thing. You cannot allow other work to be run on this unused GPU.

15:24.760 --> 15:30.200
But for MPS part you can do it. So that is cons to that but isolation is completely guaranteed

15:30.200 --> 15:37.800
in the make process. You won't have any issues or any failures. So possible failures. So I tried

15:37.800 --> 15:44.520
running a variable workload where a model uses more memory and then stops. So it uses for like

15:44.520 --> 15:50.760
5 seconds. It's a small time but still it fails. So you can see that for a variable workload and

15:50.760 --> 15:56.440
if the variance is very high even for a shorter time this make partitioning is not good idea.

15:56.440 --> 16:03.800
So yeah and also for the single GI instance and multiple GI instance O M occurs in the similar fashion.

16:05.640 --> 16:11.160
Okay so now we have make setup and we understood what it is. How can we monitor the instances?

16:11.800 --> 16:17.800
So there are two ways. First is the NVIDIA SMI if you want to click quick CLI based setup.

16:17.800 --> 16:22.920
You can monitor it using watch and see but there is another one. The DCGM exporter.

16:22.920 --> 16:28.200
This exporter collects matrix just like any other exporter like node exporter and then you can combine

16:28.200 --> 16:35.800
it with Prometheus and Grafana. So yeah this is the monitoring setup. What else can we do?

16:35.800 --> 16:40.840
Like we can slice and create make but we can also combine things together.

16:40.840 --> 16:46.600
Like you can have a make instance which can run MPS demand because MPS is essentially a service

16:46.600 --> 16:51.720
and you simply enable it within a make instance and it will work flawlessly. But there is one more thing.

16:51.720 --> 16:58.680
You cannot run make after enabling MPS. You need to disable MPS and then you can create make.

16:58.680 --> 17:05.640
So that's one thing. Yeah so conclusion for high variance workloads provision profiles with high

17:05.640 --> 17:13.560
buffer. So yeah this will result in memory, news memory and a lot of things but still it will

17:13.560 --> 17:19.640
provide isolation and we will get into it. Then for stable workloads make works well for high

17:19.640 --> 17:26.280
variance use MPS whenever possible don't rely too much on make because you will simply waste

17:26.280 --> 17:30.840
some resources which might be used by other processes. And for the large video generation

17:30.920 --> 17:37.320
models H100 is not enough. You need B200 at least for generating at least two to three videos

17:37.320 --> 17:42.120
in parallel. So yeah that's that's all. Thank you very much.

17:49.000 --> 17:50.200
Yeah you have one question.

18:00.840 --> 18:05.160
I think it's because of the architecture. Yeah can you repeat the question again.

18:13.240 --> 18:20.120
Yeah so the question is that why do we have certain partitions and when can we you know expect certain

18:20.120 --> 18:27.160
resources to be utilized properly in the partitions? Well I don't have right answer but it is because

18:27.240 --> 18:32.840
of the architecture. So I believe if NVIDIA improves on the GPU architecture we might get even better

18:32.840 --> 18:40.280
profiles which are more dynamic. So yeah hopefully that happens. Yeah we have another question.

18:45.080 --> 18:52.280
Are the profiles same for all the period teams like H100, H100, H100, H100 do they all have

18:52.280 --> 19:07.000
exactly same. Yeah combinations available for all GPUs same or do different GPUs have different

19:07.000 --> 19:11.720
available profiles. Well the answer is there are certain combinations which are similar for certain

19:11.720 --> 19:15.800
architectures but for different architectures they have different combinations available.

19:16.360 --> 19:30.280
So yeah also one more question. Yeah I haven't explored AMD or Intel I have worked with

19:30.360 --> 19:34.920
NVIDIA hopefully. Okay so we have two more questions. Yeah.

19:47.400 --> 19:49.400
Can you repeat speak loudly a bit.

20:00.280 --> 20:10.200
Yeah so I tried generating videos on the different make partitions compared to the

20:10.200 --> 20:17.960
MPs. The make ones had bit delays. Now the reason was it was not using complete SMs in these

20:17.960 --> 20:23.320
partitions because certain SMs were left over and these video generation models are SM intensive

20:23.320 --> 20:29.240
apart from the RAM requirement. So that caused around two to three minutes of delays in generating

20:29.560 --> 20:36.520
them. So I generated around one minute not one minute 40 seconds of videos and it was taking

20:36.520 --> 20:41.560
delays. Okay we have another question.

20:42.520 --> 21:02.040
Okay so the question was like do AMD have things? Yeah they do. So one person from the audience

21:02.040 --> 21:06.040
explained what are the ways in which they can be partition. So thanks thank you for that.

21:07.000 --> 21:08.520
Okay we have one more question.

21:18.200 --> 21:22.680
Yeah so the question is like can we have multiple GPUs where you can provision workload using

21:22.680 --> 21:30.600
make partitions? Yes it can only be done if you have 4x H100s available. So in this case it's

21:30.600 --> 21:35.640
in single instance but you can orchestrate your workload by different make partitions and enabling

21:35.640 --> 21:45.720
those partitions and turning them. No there are many available out there like 4x H100

21:45.720 --> 21:51.320
data also so on for B200 also there are certain combinations. It depends on the provider.

21:51.320 --> 21:57.400
So I use Worlda. So they are also providing combinations which are in smaller portions.

21:57.400 --> 21:59.800
So I use one combination.

22:05.640 --> 22:14.280
Thank you.