WEBVTT

00:00.000 --> 00:13.360
Hi, everyone. My name is Josh. I work at Perkona as a quality engineer, mainly testing databases.

00:13.360 --> 00:18.240
So recently I've been playing around with the GPUs and I was curious like what works,

00:18.240 --> 00:24.000
how can we run multiple models and what are the observations that I had. So this session

00:24.000 --> 00:28.720
will be regarding how you can run multiple models, what methods you have available and what

00:28.800 --> 00:35.040
are the good ways to do that. So the overview of this session will be we will understand why we

00:35.040 --> 00:41.840
need to partition a GPU. We will explore GPU sharing methods, then we will get a use case of

00:41.840 --> 00:46.960
video generation. So we will use a van model and test it across the multiple sharing methods,

00:47.520 --> 00:53.840
then we will see what crashes we encounter and then we will optimize the workload and we will

00:53.840 --> 01:02.560
compare it with the H100 and B200 GPUs. So why should we partition a GPU? I mean there are

01:02.560 --> 01:10.080
four key reasons which I identified number one, some models don't support batching. So you will have

01:10.080 --> 01:16.240
to run particular instance of that model again. Now you cannot get a new GPU all the time and I look

01:16.240 --> 01:22.800
at full resources to it. You need to partition it. So that's one reason. Second is your application

01:22.800 --> 01:28.480
requires separate you know compute capacity of the GPU. So you need to partition it because of

01:28.480 --> 01:35.200
that or you want to sell the compute. So you partition GPU and you sell it like model does.

01:35.200 --> 01:40.480
So there are a couple of companies doing that or the last one which is my favorite you cannot afford

01:40.480 --> 01:45.680
another GPU. So you just have one method of partitioning. You have one GPU to play around with

01:45.680 --> 01:52.480
and you just partition that. So what methods do we have at all disposal? So there are two

01:52.480 --> 01:58.880
key ways in which you can share a GPU. The one is time based. Another is spatial based. So

01:58.880 --> 02:05.520
temporal that is time slicing usually has one process allocating the entire GPU for a certain period

02:05.520 --> 02:11.040
of time. So there is a large context switch. So process one process two process three. This

02:11.040 --> 02:17.120
switch amongst each other as they you know work and the GPU is allocated to all of them.

02:17.680 --> 02:24.880
In the spatial one you have two methods. The first one is MPS. Another one is make. Now

02:24.880 --> 02:31.680
of these two methods MPS is not strict isolation. You have GPU it is been shared by multiple

02:31.680 --> 02:38.560
process. Make however has strict isolation. So you partition the GPU with dedicated

02:38.560 --> 02:43.600
partitions and your process is worked in isolation. So there is no you know problem between

02:43.600 --> 02:51.840
the processes. So yeah this is a table comparing the sharing methods that are available to us.

02:51.840 --> 02:58.160
You can see time slicing has full GPU availability. MPS also has the full GPU availability

02:58.160 --> 03:04.640
but the process is shared sharing the GPU. Make however has each process allocated certain

03:04.720 --> 03:10.560
exposure of the GPUs. So it is not complete GPU sharing. It is partial GPU sharing by each

03:10.560 --> 03:16.960
process. Times slicing has high context overhead. So your process runs for the particular time.

03:17.520 --> 03:23.200
It switches. So it has to put thing in memory again process two comes into the picture. It uses

03:23.200 --> 03:29.360
the GPU. So it is a large context which overhead. MPS has context switching because there are

03:29.440 --> 03:35.840
multiple process running. So certain context which happens. However on make there is no context

03:35.840 --> 03:40.960
which overhead because process runs independently completely isolated from other processes.

03:41.760 --> 03:46.960
So time slicing is like you are renting GPU for certain time. It is time sensitive. So if your

03:46.960 --> 03:54.240
workload is large you cannot run multiple workload together. Resource sensitive is MPS. So you have

03:54.320 --> 03:59.760
particular resource but it is being shared and if a process is using too much resources other

03:59.760 --> 04:06.240
process might fail. So MPS is resource sensitive. Make has fixed resource. So there is no problem

04:06.240 --> 04:12.240
regarding that. Times slicing you can run single workload at a time but it is running at full capacity.

04:12.960 --> 04:20.160
MPS however you can run 48 processes in parallel. Make has fixed size. So the lowest size

04:20.160 --> 04:25.680
you can do is like 7. So there are 7 fixed separate instances that you can run using Miga approach.

04:26.880 --> 04:31.760
So what are the where should we run time slicing? I mean it is good for a workload which can

04:31.760 --> 04:38.080
wait. You do not need it urgently and it takes little time. So those workload are best suited for

04:38.080 --> 04:45.360
time slicing. MPS where if you know the nature of your workload you should focus on MPS because

04:45.440 --> 04:50.400
it uses full GPU and you won't have to worry much about you know partitioning things and

04:50.400 --> 04:57.600
everything like that. Make however works best for the isolated workloads. If you have workload

04:57.600 --> 05:02.960
that requires strict isolation. No other sharing or anything like that you focus on MIG.

05:03.920 --> 05:09.680
So quality of service is guaranteed in two things times slicing because it is full GPU

05:09.760 --> 05:16.160
allocation and in MIG however in MPS there is no guarantee of quality of service. I mean what do

05:16.160 --> 05:24.000
we mean by quality of service here? By this we mean memory bandwidth and SM usage. So that is the

05:24.000 --> 05:34.000
drawback of MPS. So what are the methods which GPUs support which methods? So MPS and MIG is

05:34.080 --> 05:40.960
supported on all enterprise GPUs like MPR, Blackwell and Hopper. They also support MPS method

05:40.960 --> 05:49.120
and MIG. Professional GPUs support MPS. Few MPR and Blackwell based like A6000, Aida and all.

05:49.120 --> 05:56.320
They support MIG partitioning. Provided you have the latest drivers. Then consumer GPUs support MPS

05:56.320 --> 06:01.360
but not MIG. So these are the things you should take into consideration before you try to

06:01.600 --> 06:08.000
start partitioning things because consumer grade GPUs are just not useful and not manageable

06:08.000 --> 06:16.240
with MIG work. So let us see the first method. It is MPS. The full form is multi-processed

06:16.240 --> 06:22.880
service. So essentially it is a service running on your server. So what does this service do?

06:22.880 --> 06:28.960
I mean it is a CUDA implementation. So you have CUDA API using that. There are three things which

06:29.040 --> 06:36.960
are comprising the MPS. The first one is control demand. So this demand is what manages

06:36.960 --> 06:42.080
the MPS server. So you have an MPS server running which is sharing GPU connections with the

06:42.080 --> 06:48.720
clients. Client here is any process. Any process that uses CUDA is essentially the client.

06:48.720 --> 06:54.960
After you have enabled the control demand. So here you can see in the diagram that process A,

06:54.960 --> 07:02.240
B and C. They all pass through service controller and then multiprocessor service is actually

07:02.240 --> 07:07.600
assigning GPU portions to it. So you can see the green portion is assigned to C, purple one to

07:07.600 --> 07:16.960
B and orange one to A. So this is the basic overview of the MPS. So how can we set up MPS?

07:16.960 --> 07:22.240
Well it is pretty straightforward. You first of all select the GPU device. So in case you have four

07:23.120 --> 07:29.600
the numbering might be for the device one it will be zero. Then you have one two three.

07:29.600 --> 07:35.920
So if you have four H100s it will be named accordingly. So first of all we select the GPU

07:35.920 --> 07:42.240
that we want and then we start the MPS department. By default this is installed on all H100s

07:43.200 --> 07:47.760
and the price level servers. You don't need to install anything. It is enabled but the demand

07:47.840 --> 07:53.040
is not started by default. So you start it yourself and then whenever you run any process that

07:53.040 --> 07:58.640
uses CUDA driver it will simply act as a MPS client. So you won't have to do any other thing.

08:01.920 --> 08:08.400
Now let us see the MIG approach. What does MIG actually mean? So the MIG method as you can

08:08.400 --> 08:15.440
see on the figure C there is a GPU. It is partitioned into six instances. So there is this concept

08:15.440 --> 08:22.000
of GPU instance. Now these GPU instances are completely isolated from each other. It is a hardware

08:22.000 --> 08:27.440
based partitioning method where there is no software you know switching things between the processes.

08:27.440 --> 08:36.480
It is enabled and this is how it looks like. So a MIG instance comprises of GPU slices and you can

08:36.480 --> 08:43.520
see in the figure D. It is an example of A100. So there are seven different memory slices. You can

08:43.600 --> 08:51.040
see the number 5 GB. There are eight such partitions of the memory and seven compute slices.

08:51.760 --> 08:59.920
So GPU slices plus GPU engines equals a GPU instance on the left hand side. So GPU slice is basically

08:59.920 --> 09:07.120
a memory slice combination with a compute slice and one memory slice is roughly around one

09:07.200 --> 09:12.320
eighth of the total GPU. Similarly a compute slice is around one seventh of the GPU.

09:13.040 --> 09:23.280
So here for sake of simplicity we are writing essence as compute. So how do slices occur? I mean can

09:23.280 --> 09:29.920
I slice compute first and then slice memory? No. First of all you slice memory. You assign memory

09:29.920 --> 09:36.320
and then you assign the compute to that particular sliced memory. So it has a hierarchy and a step.

09:36.880 --> 09:44.880
So you follow that step to you know create a MIG slice. So what are the combinations that we have?

09:44.880 --> 09:51.040
So on the left hand side is the smallest instance combination where a small memory is selected

09:51.040 --> 09:58.320
and a smallest isolated compute instance is selected. So it is like one G 5 GB instance, one G

09:58.320 --> 10:06.000
means one compute and 5 GB means 5 GB of VRAM. Now this workload is completely isolated from the

10:06.080 --> 10:10.560
point of view of memory because no other block of memory is added into this combination.

10:10.560 --> 10:15.920
It is the smallest combination available and yeah this is a small instance. So size might be

10:15.920 --> 10:22.160
issue for your workload. Then you have a multiple isolated compute instances. Now what does this mean?

10:22.800 --> 10:30.960
You have a memory slice on top which is a combination of 5 or 4 5 GB instances and a one

10:31.040 --> 10:38.320
compute instance is. So this entire structure is partition into second level where one compute

10:38.320 --> 10:45.360
is enabled within the four compute setup. So one C 4 G will actually mean that you have 4 G

10:45.360 --> 10:53.760
per instance but that 4 G is partition and one compute is used from that. So that 4 G is again

10:53.760 --> 11:00.880
partition into four different parts. Yeah so here you can see that memory is still shared between

11:01.040 --> 11:07.600
those four compute. So essentially this might create problems if you have 4 isolated compute

11:07.600 --> 11:13.840
instances but they are sharing the same memory. So your workload might overwhelm the memory

11:13.840 --> 11:19.920
that is 20 GB and you might get some issues. There is another approach where you can have multiple

11:19.920 --> 11:25.840
large chunks of compute as well as memory. This essentially is a single instance with a large

11:26.320 --> 11:33.120
you know GPU compute and memory. So yeah there is one drawback that if your workload is not

11:33.120 --> 11:38.400
that intensive your compute might be wasted or your memory might be wasted. It is sitting idle

11:38.400 --> 11:44.880
so you might not be using that. So what are the overheads in make? I mean you are using the

11:44.880 --> 11:49.920
mega approach you are getting guaranteed quality but there is one problem with this approach.

11:50.000 --> 11:55.440
You are compromising certain things. You will compromise on certain memory leftovers,

11:56.320 --> 12:01.920
certain SM not being utilized and there are certain combinations in which you are essentially

12:01.920 --> 12:08.080
wasting your compute capacity. So this diagram displays an H 100 nick profiles.

12:08.880 --> 12:18.560
As you can see that 1 G has around 16 SMs. It is not exactly 1 7th of 132. So I mean you can see

12:18.640 --> 12:25.200
there are certain things which are being not added. 3 G also it is not exactly 3 times 16.

12:25.200 --> 12:30.480
It is more than that. So yeah there are certain drawbacks of using make.

12:32.560 --> 12:39.680
So how do we create a make-per-cut partition? So as you can see from figure k you are listing

12:39.680 --> 12:45.840
out the GPUs that you have. So we have a device named GPU 0 which is an H 100 ADGB instance.

12:46.800 --> 12:53.520
Then you see what are the profiles that are available to you? What are free profiles available to you?

12:53.520 --> 12:58.240
And then you partition them. So let us see how we can create a make instance.

12:58.240 --> 13:03.280
So first of all we need to enable the make-mode. This mode is not enabled by default.

13:03.280 --> 13:07.280
So you create this make-mode using the first step.

13:07.280 --> 13:11.520
Then on the second step you create a make-instance on the GPU 0.

13:11.680 --> 13:19.200
Hyphen i0 is basically saying that use GPU 0 and create a GPU instance. CGI means GPU

13:19.200 --> 13:25.520
instance. And 9 number that you are seeing is the profile ID. You can see the profile ID on the

13:25.520 --> 13:32.320
figure L. There is it is it is 3 G 40 GB. So you are creating 2 such instances. So 9

13:33.120 --> 13:43.520
is basically 2 profiles of 4 G 40 GB. Sorry not 3 G 40 GB. So you can list individually like

13:43.520 --> 13:49.440
with GPU instance you have and which compute instance you have. You can see that step 3

13:49.440 --> 13:55.040
and step 2. They have order. You first create a GPU instance and then you assign compute

13:55.040 --> 14:01.120
instance to it. So what are the other things that you can do with the make partitioning?

14:01.120 --> 14:07.200
Well if you partition and create make you can enable MPS demand within it. So MPS is basically

14:07.200 --> 14:14.240
a service which can be managed. So yeah you can create a combination like this. You can have variable

14:14.240 --> 14:19.920
workload like P1, P2, P3 running on MPS within a make-instance while you can have a completely

14:19.920 --> 14:26.800
separate isolated workload on make-instance 2. There is one more thing that you should keep in mind

14:26.800 --> 14:33.280
if you have enabled make service first you cannot create MPS. Sorry if you have MPS service

14:33.280 --> 14:38.720
enabled first you cannot create make. There is an order to it so it disable MPS demand and then

14:38.720 --> 14:46.960
you can create make. So let us look at the example that we tested. So when 2.2 model is what we

14:46.960 --> 14:53.360
used. There are 2 models. One is a large 14 billion parameter one. Then there is another one which

14:53.520 --> 14:59.840
is text input based, 5 billion parameter model. The left hand side, the larger one uses prompt

14:59.840 --> 15:07.520
image and audio to generate a video. So I will study for 4-itip pixels how much memory is needed

15:07.520 --> 15:13.760
what is the constant workload. So we know the nature of this workload like it stays constant

15:13.760 --> 15:20.160
late 55 GB VM for a certain duration of time but at the end when the files are being created it

15:20.160 --> 15:27.040
goes and shoots up to 58 GB. So peak means that particular 30 second duration. So there is a medium

15:27.040 --> 15:36.720
parameter model which is the text input one. It generates 720p pixel videos. So we encountered

15:36.720 --> 15:44.640
out of memory. We are testing these things on MPS and you can see in the diagram that B is where

15:44.640 --> 15:50.000
you can monitor the CUDA process. So what I did was I created terminal started the NVIDIA SMI

15:50.400 --> 15:57.200
and kept it in watch. I ran 2 processes in the bottom shell CND just to see what happens.

15:58.640 --> 16:05.600
Two of them were running. Both of them are very large models. It needs 55 GB. Now while both of them

16:05.600 --> 16:12.720
started at the same time they need more resource. Now one process was able to outcompit the other one

16:12.720 --> 16:17.040
and the other one failed. So you can see on the C that we got the out of memory error.

16:17.920 --> 16:25.920
Okay, so what else did we tested? We tested running 3, 5 billion parameter models using MPS.

16:26.800 --> 16:32.480
Where workload was constant? We were getting around 22 GB memory equally distributed by all the

16:32.480 --> 16:40.080
processes but just before the end it crashed because one of them required 33 GB and that another

16:40.080 --> 16:46.400
workload got cancelled. Good thing about this is that it tells you which process is causing the issue

16:46.400 --> 16:53.680
like because of process 2, 2, 1 and 3, 7, 4 things got crashed and this process is exited.

16:55.520 --> 17:02.720
So what were the tests that we found? Well you can see with this graph we were able to run full

17:02.720 --> 17:09.760
GPU baseline tests. Time to video generation was good but it failed in a couple of them.

17:10.160 --> 17:17.200
So this is the overview of the performance test that happened. The last one is if you want to run

17:17.200 --> 17:23.120
a big model you cannot do it on H100. You need a B200. So we also tested that in a B200.

17:25.280 --> 17:32.720
But there is one more method. Like this process head around equally 22 GB usage by 3 processes.

17:32.720 --> 17:38.160
We can run them in certain order in which we can make sure that three of them runs properly.

17:38.400 --> 17:45.440
So yeah this is where we use that. So we know that at the end 30 seconds is where problem occur.

17:45.440 --> 17:50.320
So why not we simply start each process one minute after the other. Like staggering things.

17:50.320 --> 17:55.600
You start certain things first and then wait and process other things in parallel. So we were

17:55.600 --> 18:00.800
able to successfully do that and it took 10 minute and 30 seconds to generate three videos which is

18:00.800 --> 18:07.760
quite quick compared to all the workloads that we have. So changing your workflow strategy can

18:07.840 --> 18:15.760
improve your performance and throughput. Also let us see on the B200. Like B200 has good profiles.

18:16.640 --> 18:22.320
But you can maximum generate three make profiles because even though it is 180 GB it is not equally

18:22.320 --> 18:28.400
partitioned. There are certain combinations which are not used full for our workload and same

18:28.400 --> 18:34.720
goes for the 14 billion. You can only have two. So we tested this thing and we were able to generate

18:34.800 --> 18:42.960
three large videos and it worked in a really good time. So this is the comparison of both methods

18:42.960 --> 18:51.120
on H100 and B200. As you can see the H100 using MPS we were able to run two without any issues

18:51.120 --> 18:58.160
three using staggered approach. We were only able to run one large model but when we used B200 we were

18:58.160 --> 19:04.000
able to three run three of them in parallel without any issues. Well for the smaller model we were

19:04.000 --> 19:08.720
able to run five in parallel with other issues. So that is a good thing about using a big GPU.

19:09.920 --> 19:15.920
Then for the make partitions there was not much improvement. You can only have three

19:16.720 --> 19:21.440
based on the partitioning approach which will be faster but using four it will be a bit slower.

19:24.400 --> 19:30.560
So what is the conclusion? There is no single method which is great. Make partition seems

19:30.640 --> 19:38.080
really good but it is not good at certain cases. For large models it is recommended to use B200.

19:38.080 --> 19:43.120
You can try to fit workload but you will need to optimize your model way more which will compromise

19:43.120 --> 19:49.280
the quality of the output. And yeah that is the last thing. Optimize your model as much as you can.

19:51.680 --> 19:56.720
Also I will be having two more talks. One will be mix specific and another will be monitoring.

19:56.720 --> 20:00.960
So monitoring will be a bit more interesting like how you can monitor the GPUs. What are the key

20:00.960 --> 20:03.200
things? So yeah see you.

20:09.840 --> 20:14.880
Thank you very much and fortunately no time for questions but you can grab this speaker afterwards.

