WEBVTT

00:00.000 --> 00:11.960
Beyond NVIDIA SMI. So, it will be a session regarding a beginner who is exploring GPUs

00:11.960 --> 00:16.880
and what are the pitfalls that you encounter. So, I work at Perkorn as a quality engineer

00:16.880 --> 00:22.480
and recently due to the rise of GPU and AI I was curious about GPUs like how can I

00:22.480 --> 00:28.320
measure how much GPU performance is going on. So, that was the core intention of you know

00:28.320 --> 00:34.320
going this session. So, we will see how GPUs are like the fundamentals of the GPU common

00:34.320 --> 00:41.200
monitoring mistakes then we will look at an example how you can identify potential left over potential

00:41.200 --> 00:47.360
of your GPU then we will understand the key matrix which are needed then we will go into the

00:47.360 --> 00:53.280
workload specific matrix like what workloads specific matrix you need to make sure are working

00:53.280 --> 01:00.000
and then we will see some utilities which are good. So, yeah. So, GPUs are primary

01:00.000 --> 01:07.520
comprised of compute memory and decoders. So, unlike CPUs compute portion of GPUs are very large

01:07.520 --> 01:11.440
if you want to simplify things and there are certain decoders specific to certain functions.

01:12.320 --> 01:18.320
Compute is the main thing here and streaming multiprocessors are a vague way of saying that

01:18.400 --> 01:24.000
your computes are you know running. So, it is a portion which is executing the computations

01:24.000 --> 01:30.480
based on the data types that you have. So, yeah there are three common mistakes that I found

01:30.480 --> 01:36.320
relying too much on NVIDIA SMI it is a basic utility that you run and you see okay I am getting

01:36.320 --> 01:41.840
and dead percent GPU utilization I am getting this much DRM utilization. I mean it is a good thing

01:41.840 --> 01:47.920
you can use it for certain workloads but it is not saying you all picture. So, to identify that

01:48.000 --> 01:52.800
you need to monitor other matrix. So, there are three matrix which I have also able to identify

01:52.800 --> 02:00.160
the tensor core matrix then SM utilization and DRM. So, yeah the combination of these tells you

02:00.160 --> 02:07.360
the picture like what actually your workload is doing. So, let us look at one example we are

02:07.360 --> 02:14.080
watching NVIDIA SMI and we are seeing one hundred percent all the time. So, obviously we think like

02:14.160 --> 02:20.000
CPUs it might be using a hundred percent of my GPU but it is actually wrong if you look at only

02:20.000 --> 02:25.600
this metric and try to you know by new GPUs you will lose a lot of money. So, it is not a reliable

02:25.600 --> 02:32.720
metric as such. So, one such example is this I ran an FP32 workload on the left hand side

02:32.720 --> 02:38.240
and FP16 on the right hand side. You can see that NVIDIA SMI showing me a hundred percent on both

02:38.320 --> 02:44.400
the cases but is it actually true? I mean why is it showing the same thing for both different

02:44.400 --> 02:50.320
workloads. Does that mean that I can not run much more workload on it. So, that is what we will

02:50.320 --> 02:58.080
see. So, yeah this is the case of idle tensor course where I am using the same set of that was

02:58.080 --> 03:03.680
previously done but we are showing what actually is going on in details. So, that is this utility

03:03.840 --> 03:09.840
called DCGMI. So, it is a data center monitoring tool by NVIDIA certain portion of it is open source

03:09.840 --> 03:13.760
there are certain methods which they are not releasing to the public. So, that is proprietary

03:13.760 --> 03:19.280
but it is the exporters and all those things you can basically contribute to the project. So,

03:19.280 --> 03:24.560
it is an open core utility and you can see that in the first case there is zero tensor core

03:24.560 --> 03:30.800
utilization but on the other it is using completely. Now, why is this happening? So, by default on

03:30.880 --> 03:37.920
H 100s and even on certain GPUs F p 32 is not tensor supported. So, it is using CUDA course

03:37.920 --> 03:43.200
instead of the tensor course. So, this behavior is due to that and on the second one when we ran

03:43.200 --> 03:48.880
around the F p 16 workload it is using the tensor course which means you are actually getting

03:48.880 --> 03:55.120
more out of your GPU compared to the previous one. So, relying only on 100 percent utilization

03:55.200 --> 04:03.440
is a bad idea if you are trying to you know do this. So, now we saw we can conclude that GPU

04:03.440 --> 04:08.480
utilization is not same as GPU efficiency because in this picture you can see a tensor core

04:08.480 --> 04:14.240
example tensor core has multiple supported data types that you can execute apart from the CUDA course.

04:14.240 --> 04:20.080
So, you need to understand your nature there work your load that is be executed what I exactly

04:20.160 --> 04:26.240
data types they are using can be used the data types which tensor core supports. So, yeah

04:26.240 --> 04:30.720
but what are tensor cores actually? So, they are basically matrix multiplication support

04:32.080 --> 04:37.120
hardware implementation by Nvidia. So, the hardware level they are doing matrix multiplication

04:37.120 --> 04:43.120
and that is what drives the entire AI workload. So, that is one thing that you should take into consideration.

04:43.200 --> 04:53.040
So, 100 SMI what exactly does that mean? It is simply measures how many SMs are doing work like

04:53.040 --> 04:57.440
it simply measures if SMs are doing work not exactly how many are doing work it does not give you

04:57.440 --> 05:03.520
much activity details like SM activity or tensor core usage. So, it is not good idea if you want

05:03.520 --> 05:10.480
much details. So, how exactly is this utility working? So, you instead of a data center package

05:11.360 --> 05:16.560
then you start a service which is a Nvidia DCGM service it runs in the back end it collects the

05:16.560 --> 05:23.840
matrix and yeah you can monitor your matrix using the matrix code. So, DCGM has a certain set of

05:23.840 --> 05:29.600
codes that you can use to monitor this thing and it checks things at a certain duration of time.

05:29.600 --> 05:33.600
So, it is a snapshot of a time. So, it averages out for certain matrix.

05:33.920 --> 05:41.520
Okay. So, what are the matrix that we should measure? So, first of all if you are running an

05:41.520 --> 05:47.440
AI inference workload you should focus more on the tensor core utilization because it is the

05:47.440 --> 05:50.880
performance driven thing and if you are not using that you are leaving too much at the table.

05:51.520 --> 05:57.040
Then there is memory bandwidth there is SM occupancy, SMI activity and estimation of the

05:57.040 --> 06:02.000
team drops. So, the performance benchmark that Nvidia gives you how much are you actually using.

06:02.080 --> 06:07.440
So, you can get some estimation of the team drops rather than writing code inside your

06:07.440 --> 06:13.280
models and all those things. So, let us this is a diagram of tensor course. So, you can see

06:13.280 --> 06:18.720
there are multiple matrix multiplication as well as additions happening at the same time. So,

06:18.720 --> 06:23.920
this is something that they implemented in their hardware and if you are using that you are

06:23.920 --> 06:28.240
speeding up your inference. So, this is the thing that you should take into consideration.

06:28.800 --> 06:36.080
So, one SM consists of multiple tensor course. So, this is just a portion of SM there are many

06:36.080 --> 06:43.920
more in a GPU. So, yeah. So, yeah. So, yeah as I talked earlier the potential performance

06:43.920 --> 06:48.960
improvement. So, they have performance benchmarks and you can see that particular data type

06:48.960 --> 06:54.480
workload how much T-flops are expected, how much you can you know get out of it. So, if you are

06:55.040 --> 07:01.040
using a simple CUDA course setup of F 16 you are getting only 120 T-flops but if you use it

07:01.040 --> 07:07.120
using tensor core you are getting around 1000 T-flops. It is like 8 to 10 times more faster. So,

07:07.120 --> 07:14.000
that is the thing enable tensor course whenever possible. So, how can we monitor it? So, there are

07:14.000 --> 07:19.760
certain codes that we discussed earlier. So, these are the matrix which are collected by DCGM

07:20.160 --> 07:26.080
and they have certain codes what they actually check is written on the right hand side. So, yeah

07:26.080 --> 07:30.880
the matrix multiplication is the key thing in case of tensors. So, you have simple matrix

07:30.880 --> 07:38.960
multiplication integral F 16 which is half bit and double precision that is F 64. So, these are

07:38.960 --> 07:44.960
the matrix that you should focus on and look at it. So, this is an example of an F 16 workload.

07:44.960 --> 07:49.520
You can see that tensor course are being used but we do not know exactly what tensor course are

07:49.600 --> 07:57.600
using. So, we get details using certain codes. So, yeah this is an F 16 example. So, you can say F

07:57.600 --> 08:03.440
P 16 and tensor usage are same because they are using the same particular thing. So, tensor is

08:03.440 --> 08:08.560
basically a generic overview of your tensor core uses but the other three matrix are

08:08.560 --> 08:16.160
detailed view of those particular things. So, yeah this is another example of F 64 workload. So,

08:16.240 --> 08:22.080
you can see double precision is something which is being shown similarly for intay to workload.

08:23.360 --> 08:29.520
So, what are actually SM's isometric that we are looking at. So, everything executes as a thread.

08:29.520 --> 08:37.440
So, the block of threads is a combination called wrap and warp basically run on the streaming

08:37.440 --> 08:43.840
multiprocessors. So, this is something that you should see. So, SM occupancy is a metric that you should

08:43.920 --> 08:51.600
be aware of. So, it is like SM's are occupied. The things are about to run but what actually

08:51.600 --> 08:56.800
is running is seen using SM activity metric. So, these are the two things. Now, one thing that you

08:56.800 --> 09:03.680
should keep in mind is that these matrix are averaged out. So, you do not get exact value for SM.

09:03.680 --> 09:10.720
It is averaged over all the SM's that are available. So, this is an example of an high occupancy

09:11.440 --> 09:19.040
you can see that SM activity is around 0.97. But, SM occupancy is like 0.66. So, this is actually

09:19.040 --> 09:26.320
doing some work if you try to infer from this. A low occupancy means that memory is being used

09:26.320 --> 09:30.800
but the SM's that you are seeing right. They are not doing much work. They are sitting idle

09:30.800 --> 09:38.000
for things. So, that is one inference that you can do from this. So, how can we measure T

09:38.080 --> 09:44.320
flops. So, as we saw like certain things are being executed using SM's. So, we can multiply

09:44.320 --> 09:50.240
those percentages. You can get idea like what exactly is happening using the peak T flops which

09:50.240 --> 09:55.440
are given in the benchmark. So, you take the peak T flops, you multiply it with the SM activity

09:55.440 --> 10:00.080
as well as SM occupancy and then you multiply it by the tenser activity that you have. So,

10:00.080 --> 10:05.120
it will give you a brief idea an estimated idea you do not get exact value out of it. But,

10:05.120 --> 10:10.800
you can just identify that this thing is doing this portion of the given benchmarks.

10:12.000 --> 10:16.480
Where this might not work is for the burstable workload like if you have very variance in the

10:16.480 --> 10:21.040
workload this is not a good idea to do. But, for the stable workloads this method will probably

10:21.040 --> 10:28.000
give you an good estimation. So, yeah, things we should keep in mind by themselves,

10:28.000 --> 10:34.560
GPUs are useless, you need CPUs and CPUs are transferring things to the GPUs. Now, the thing is

10:34.640 --> 10:39.920
that this transfer is the bottleneck nowadays. You can get more compute, but the transfer of the

10:39.920 --> 10:46.000
data from the CPU memory to the GPU memory is where things are breaking. So, once such observation

10:46.000 --> 10:52.800
you might see is that you get high SM occupancy, low SM activity and a high VRAM. What this means is

10:52.800 --> 11:00.080
that you have so much stuff and so much activity, but you do not do things. So, SMs are occupied,

11:00.240 --> 11:05.280
but they are doing very less work. That is what you can identify from this. And this is okay.

11:05.280 --> 11:09.920
If you get this kind of workload it is assumed that it is an inference workload.

11:10.640 --> 11:15.200
Inference workloads do not use your GPUs to the fullest in terms of compute.

11:18.080 --> 11:24.320
Yeah, also the DCGMI utility that we saw has one more cool feature like you can do certain profiles

11:25.280 --> 11:30.400
so what do we mean by profile like for NVIDIA SMI you can get brief idea about things.

11:31.200 --> 11:35.840
You can see the temperature as well, but you cannot monitor things using CLI. This utility has

11:35.840 --> 11:44.240
in built that setup. So, we set a policy for 60 degree centigrade temperature and we are registering

11:44.240 --> 11:52.880
that policy. So, when we run a workload it will listen for violation. You can see on the top that

11:52.960 --> 12:04.800
temperature is rising and as soon as it hits 60 it will give you an notification on the terminal.

12:04.800 --> 12:08.960
So, rather than watching NVIDIA SMI using this you can utilize it like this.

12:08.960 --> 12:23.280
And it by default monitors every five seconds. So, yeah the libraries which are being used are

12:23.280 --> 12:27.920
set to this. You cannot change those particular models because they are controlled by NVIDIA.

12:30.720 --> 12:34.880
You can enable those models. So, that is something that you can do.

12:35.360 --> 12:40.800
Yeah, so this is a brief overview of the graph that we saw earlier like what are the metrics,

12:40.800 --> 12:48.480
what are the DCGMI codes and what are the purpose of it. Now, comparing with the tools

12:48.480 --> 12:54.720
that are available out there. So, DCGMI is a good tool it is an open core tool it uses NVIDIA,

12:54.720 --> 13:00.880
but it is like you can say community driven as well. The exporters and all those things are

13:00.960 --> 13:05.760
something that you can contribute. NVIDIA SMI is a good thing, but it does not give you much

13:05.760 --> 13:11.280
detail it is a very basic tool. So, relying on it can give you bad inference.

13:12.320 --> 13:18.400
There are other tools which are proprietary. So, for example, inside compute is a SM level

13:18.400 --> 13:22.480
activity that you can see. So, it is specific to the kernel activity that you want to monitor.

13:22.480 --> 13:27.760
So, it gives you more details. The inside system is a micro level of your GPU but these two are

13:27.760 --> 13:36.960
proprietary. So, yeah monitoring solutions, yeah you can also set up an exporter of the DCGMI.

13:36.960 --> 13:41.360
So, they have the exporter available out there and combine it with promising graphana. So,

13:41.360 --> 13:48.560
you can create a dashboard out of it. So, major workload as we discussed is FP32, the floating

13:48.560 --> 13:54.880
data types, how we can utilize them. There is also one more thing that if you have an AI workload

13:54.960 --> 14:00.640
and it is using FP32, it can also be converted to the tensor float 30 proof. So, make sure that

14:00.640 --> 14:05.920
you identify that thing first, whenever you try to run those things because it will improve the

14:05.920 --> 14:12.960
performance. So, as we discussed memory bandwidth is the key thing here. So, having a huge

14:12.960 --> 14:18.960
VM is not a good thing. Like if you have a good GPU, which has large VM, but it is memory

14:18.960 --> 14:23.040
bandwidth is low, you are not getting much out of it. You are simply loading models and you are

14:23.120 --> 14:28.960
not executing it faster. So, that is one thing. So, compute is going way faster right now. So,

14:31.600 --> 14:38.960
yeah that is it from hand.