WEBVTT

00:00.000 --> 00:09.360
Can you get off the last talk of the day?

00:09.360 --> 00:10.800
Good afternoon, everybody.

00:10.800 --> 00:11.800
Happy Sunday.

00:11.800 --> 00:15.040
We are finally at the end of Boston, so I'll try not to low you to sleep with the

00:15.040 --> 00:16.640
hardware talk.

00:16.640 --> 00:20.280
Wait, hardware talk.

00:20.280 --> 00:24.120
Or how I manage to sneak a hardware design talk into the HPC room.

00:24.120 --> 00:26.840
So hi, I'm Felix.

00:26.840 --> 00:27.840
Who's this bozo?

00:27.840 --> 00:34.400
I'm a platform engineer at AI Neco, I'm an open source guy, I work on HPC.Social, I'm a

00:34.400 --> 00:38.120
contributor at chips and cheese, we have stickers, they're great, go take them, you

00:38.120 --> 00:42.280
might notice the color scheme on the stickers is a fuchsia and a blue, which is also

00:42.280 --> 00:45.160
the theme of the background, which will hopefully work.

00:45.160 --> 00:50.400
So I've been referred to lovingly as a practical travel maker, so I'm going to try to

00:50.400 --> 00:52.720
not cause too much trouble today.

00:52.720 --> 00:56.600
But the broad topic of what we're looking at me looking at is, what is the base element of

00:56.600 --> 00:58.440
how we do computations?

00:58.440 --> 01:01.040
What is the most basic building block?

01:01.040 --> 01:03.920
And what does it mean to say that a process is accelerated?

01:03.920 --> 01:09.080
When we're talking about hardware acceleration, there's this nebulous concept of, let's

01:09.080 --> 01:11.240
move this slide, what are we discussing?

01:11.240 --> 01:12.880
What defines an accelerator?

01:12.880 --> 01:15.880
What defines the workload that we're trying to accelerate?

01:15.880 --> 01:20.240
What maps that workload be particularly good on a class of accelerator, let alone an

01:20.240 --> 01:24.960
implementation of that concept on an accelerator?

01:24.960 --> 01:28.760
And what is the design space that hardware people talk to software people about?

01:28.760 --> 01:33.720
And how do we go and define the platform that interface and that contract so people can

01:33.720 --> 01:37.640
make realistic usage of that hardware?

01:37.640 --> 01:39.760
What are we not going to be discussing?

01:39.760 --> 01:43.840
I'm not here to tell you that what you bought for your cluster was the wrong choice for

01:43.840 --> 01:47.440
that class of code, that's not what I'm here to say.

01:47.440 --> 01:51.640
I'm not here to tell you your device selection or the device selection of your users isn't

01:51.640 --> 01:52.840
reasonable.

01:52.840 --> 01:56.680
There are plenty of reasons to choose a particular class of accelerator, be that the

01:56.680 --> 02:01.200
interfaces, the ergonomics, the completeness of the software stack, or that it does a very

02:01.200 --> 02:05.640
particular operation that is right for you very, very well.

02:05.640 --> 02:10.400
Classical example of this, everyone hated AMD CPUs back in the bulldozer days.

02:10.400 --> 02:12.720
But what did bulldozer do?

02:12.720 --> 02:15.920
Bulldozer had FMA before anyone on the Intel side did.

02:15.920 --> 02:20.360
So for certain classes of clusters, yeah, my single threaded code is slower, but all of

02:20.360 --> 02:25.720
my FMAs, which for those classes of HPC codes were FMA bound, I'm literally twice as fast

02:25.720 --> 02:29.600
per clock in the same area for less money and the same amount of power.

02:29.600 --> 02:36.480
So implementations matter what we accelerate in our operations matters.

02:36.480 --> 02:42.760
I'm going to contend that everything we use today is an accelerator of some form and we're

02:42.760 --> 02:43.920
going to get to what that means.

02:43.920 --> 02:46.960
But I think you're going to start to understand what we're actually doing here.

02:46.960 --> 02:55.120
So chapter one, the humble, simple CPU, what is the CPU?

02:55.120 --> 02:56.680
The basic building blocks.

02:56.680 --> 02:57.920
A CPU is very simple.

02:57.920 --> 03:01.520
A lot of people here who took one or two electrical engineering courses in school or

03:01.520 --> 03:06.600
who have an interest in this sort of thing will think of a CPU as I have a memory somewhere.

03:06.600 --> 03:11.120
That can be DDR, that can be SRM, it doesn't matter, memory exists.

03:11.120 --> 03:14.960
In that memory is data, I would like to operate on that data.

03:14.960 --> 03:20.880
I need to define a contract for the machine to tell me, for me to tell it, how it operates

03:20.880 --> 03:22.480
on that data.

03:22.480 --> 03:26.640
So you fetch an instruction, that's me telling the hardware what it needs to do.

03:26.640 --> 03:31.800
That hardware needs to interpret that instruction, I decode it, and then I execute it.

03:31.800 --> 03:34.920
You tell me what you want me to do, I figure out how to do it, and then I do it.

03:34.920 --> 03:38.600
And then I give you your results back, that is the interface, that is the question.

03:38.600 --> 03:41.280
And all computers that the baseline are doing this.

03:41.280 --> 03:45.840
This is the only way to implement anything in computer engineering, computer science, and

03:45.840 --> 03:51.560
HPC or anything that is not accelerated is when you have these simple in order fixed function

03:51.560 --> 03:54.800
devices.

03:54.800 --> 03:58.200
The basic building block, we have a register file.

03:58.200 --> 04:00.880
That register file is what we operate on.

04:00.880 --> 04:03.960
We also have an ALU in arithmetic logic unit.

04:03.960 --> 04:10.400
That unit takes the data in the register file, operates on it.

04:10.440 --> 04:11.400
The fuck is an ALU?

04:11.400 --> 04:14.960
Wait, I should probably not do that.

04:14.960 --> 04:19.960
So fundamentally, what are the basic logical operations that anything in everything in

04:19.960 --> 04:22.280
computer science can boil down to?

04:22.280 --> 04:26.560
We need to be able to do logical manipulations, shifts, these sort of ops.

04:26.560 --> 04:31.720
And as we do those, we can actually reconstruct every possible operation that any instruction

04:31.720 --> 04:36.840
or any command could be broken down to those fundamental operations.

04:36.840 --> 04:42.280
But if I had to do shifts and alignment and so on, every time I wanted to do an FMA,

04:42.280 --> 04:44.080
I had a very bad time.

04:44.080 --> 04:48.640
If I went to market with that device, that would also give me a very, very bad time,

04:48.640 --> 04:49.880
because it's going to be slow.

04:49.880 --> 04:53.240
On each instruction I need to check, oh, did it do the correct movement.

04:53.240 --> 04:54.240
Am I aligned?

04:54.240 --> 04:57.760
I need to do the detection for, are my floating point bits aligned?

04:57.760 --> 04:58.760
Am I in the right rounding mode?

04:58.760 --> 05:02.520
Oh, now I need a subroutine that goes through and checks rounding modes, and then changes

05:02.520 --> 05:04.600
the implied state of the rest of the machine.

05:04.600 --> 05:09.480
Having every single portion of that independently and one at a time, that sounds really

05:09.480 --> 05:10.480
slow.

05:10.480 --> 05:15.840
And what do we do when we want something to not be slow, we accelerate it?

05:15.840 --> 05:18.120
And that's the full portion here.

05:18.120 --> 05:21.920
When you can do anything, you want with just integers.

05:21.920 --> 05:25.800
All the math fundamentally breaks down to integers, be they signed or unsigned, one

05:25.800 --> 05:27.400
way or another.

05:27.400 --> 05:29.720
But we don't always want to operate on integers.

05:29.720 --> 05:31.000
Sometimes we do.

05:31.000 --> 05:32.000
Sometimes we don't.

05:32.000 --> 05:36.000
Classes of codes will want to operate on them.

05:36.000 --> 05:38.120
And that's the point.

05:38.120 --> 05:39.120
This isn't easy.

05:39.120 --> 05:41.120
It's also not the European software thing.

05:41.120 --> 05:43.120
This is a different type of easy.

05:43.120 --> 05:44.440
It's not fast.

05:44.440 --> 05:45.640
And we want things to be fast.

05:45.640 --> 05:48.800
We call ourselves hyperformers, computing people after all.

05:48.800 --> 05:54.240
So the first accelerator, if I hadn't laid the foundation easy enough, the first accelerator,

05:54.240 --> 05:57.200
I contend, was the floating point unit.

05:57.200 --> 06:00.200
I'm not going to get the only floating point standard that matters.

06:00.200 --> 06:02.960
I should be simplify for our holy be it's name.

06:02.960 --> 06:07.880
But with that being the case, we're trying to do when we talk about floating point or

06:07.880 --> 06:12.520
other types of operations is I don't want to always be dealing with the minutiae of

06:12.520 --> 06:13.840
the machine.

06:13.840 --> 06:17.440
I want the machine to handle some of that complexity for me.

06:17.440 --> 06:21.720
I want to establish a contract with that machine that so long as I follow the rules

06:21.720 --> 06:25.120
of a format, I follow the rules of the input.

06:25.120 --> 06:29.560
It can make certain classes of operations significantly faster.

06:29.560 --> 06:34.320
At the same time, if I break that contract and give something that's now formed to that

06:34.320 --> 06:38.560
unit, it's going to accelerate garbage, and I'm going to get garbage back, and I will

06:38.560 --> 06:41.960
deserve to get garbage back because I broke the contract.

06:41.960 --> 06:43.960
That's how we operate.

06:43.960 --> 06:49.040
So we trade the generality for fixed function, fixed results.

06:49.040 --> 06:53.240
And as long as the implementer also honors their half of the contract, I get that back.

06:53.240 --> 06:58.720
I can trust the machine to give me what I told it I wanted back.

06:58.720 --> 07:00.720
Is that okay?

07:00.720 --> 07:04.720
Do you care about how the math gets done in the computer for so long as you get the correct

07:04.720 --> 07:05.720
results?

07:05.720 --> 07:08.720
I'm already down five minutes, good Lord.

07:08.720 --> 07:10.160
Of course we do.

07:10.160 --> 07:12.360
We have to care about how the machine operates.

07:12.360 --> 07:16.720
We have to care about what the machine uses back, and we have to care about the performance.

07:16.720 --> 07:19.720
Because the goal is not for me to come here and talk about computer architecture and so

07:19.720 --> 07:20.720
on.

07:20.720 --> 07:22.360
The goal is we have users.

07:22.360 --> 07:24.120
Those users use those machines.

07:24.120 --> 07:27.040
Those users have to operate on the machines, but they don't do it for the sake of doing

07:27.040 --> 07:28.040
it.

07:28.040 --> 07:29.040
They have a goal.

07:29.040 --> 07:32.040
They're doing math, not for the sake of doing math as fun as that is.

07:32.040 --> 07:34.040
They're doing math because they're looking to create a result.

07:34.040 --> 07:37.760
Be that a paper, be that assimilation, be that verifying that a car, when it crumples

07:37.760 --> 07:40.600
on impact, is not going to kill you.

07:40.600 --> 07:42.120
There are honest goals for how we do it.

07:42.120 --> 07:47.880
But the way we approach these problems is we have to give people tooling.

07:47.880 --> 07:50.360
So let's get a little bit more complicated.

07:50.360 --> 07:51.360
Simple three stage.

07:51.360 --> 07:53.120
You fetch, you decode, you execute.

07:53.120 --> 07:55.040
Everything we do is a version of this model.

07:55.040 --> 07:58.000
Just simply how big we allow it to be.

07:58.000 --> 08:01.280
You'll notice I did a little bit to slide a hand.

08:01.280 --> 08:06.960
Now on this little diagram, I added decode, but in the decode fades, now I have two

08:06.960 --> 08:11.960
different options and I have an integer side and I have a floating point side because

08:11.960 --> 08:12.960
I have to.

08:12.960 --> 08:17.520
If I choose to have an accelerator, that accelerator is fixed function.

08:17.520 --> 08:21.000
But the machine has to be general, the machine must be able to operate on everything

08:21.000 --> 08:22.000
I give it.

08:22.000 --> 08:26.720
So that being the case, I have to be able to decode, mocks and send different data around

08:26.720 --> 08:28.200
different devices.

08:28.200 --> 08:31.200
And when that's all in order, that's fine.

08:31.200 --> 08:35.160
But you're not always in order, or at least the machine isn't, it might not be telling

08:35.160 --> 08:36.160
you about that.

08:36.160 --> 08:40.000
So let's get to the not so simple CPU.

08:40.000 --> 08:42.000
The not so basic building blocks.

08:42.000 --> 08:43.000
Super scalar.

08:43.000 --> 08:44.000
What does that mean?

08:44.000 --> 08:47.400
You have scalar units, but you have a superset of them.

08:47.400 --> 08:48.880
It's how I like to think about it.

08:48.880 --> 08:54.560
We have multiple ways of issuing data, of issuing instructions and where they go.

08:54.560 --> 08:57.760
Why do we care, though, and what are the nuances of this machine?

08:57.760 --> 09:03.520
When we think about the basic building blocks, doing a lot of certain functions in hardware,

09:03.520 --> 09:05.640
some are going to take a lot longer than others.

09:05.640 --> 09:07.000
Some require more bits.

09:07.000 --> 09:10.120
Think in FP32 versus an FP64 FMA.

09:10.120 --> 09:14.960
Well, in hardware, in silicon, those required different numbers of gates.

09:14.960 --> 09:19.000
If they require different numbers of gates, and I'm pushing the clock speed up, there

09:19.000 --> 09:24.840
takes time to move electrons, to move current, to move voltage across the machine.

09:24.840 --> 09:29.720
And if the area is bigger, then I have to decide, I'm telling you, as a contract, you

09:29.720 --> 09:31.600
will get the first, the correct answer.

09:31.600 --> 09:32.600
You always get that.

09:32.600 --> 09:37.360
I do not necessarily then have to guarantee when you will get that answer back.

09:37.360 --> 09:39.440
So long as the guarantee is, you will get that answer.

09:39.440 --> 09:41.240
And I do make that promise to you.

09:41.240 --> 09:46.120
If you don't, then it's the halting problem and it's paying.

09:46.120 --> 09:49.640
So not all these instructions take the same amount of time.

09:49.640 --> 09:52.720
So not the not so simple version.

09:52.720 --> 09:57.640
As we go and we build bigger and bigger devices, bigger and bigger CPUs, we've made the

09:57.640 --> 10:01.000
decisions that certain operations must be very fast.

10:01.000 --> 10:05.400
But certain operations must be fast from a concept of latency.

10:05.400 --> 10:09.640
Certain operations must be fast from the context of throughput.

10:09.640 --> 10:13.160
And they don't always have to go together, and they don't all have to happen at the same

10:13.160 --> 10:14.160
time.

10:14.160 --> 10:18.480
It has much like us running HPC centers, running supercomputers, running different workloads.

10:18.480 --> 10:23.240
Not everyone is doing the same math, and it's not always structured to be the same math.

10:23.240 --> 10:28.040
But we're trying to build a machine that can accelerate a large class of problems.

10:28.040 --> 10:33.000
So now the same way that I had fixed function hardware versus general hardware, I have fixed

10:33.000 --> 10:35.440
function and general acceleration.

10:35.440 --> 10:38.520
The general version of the acceleration versus the fixed version of it.

10:38.520 --> 10:39.760
How do I approach that problem?

10:39.760 --> 10:44.640
How do I build hardware and contracts to be able to do that?

10:44.640 --> 10:47.840
And then this is sort of what you look like when you're building a really big CPU these

10:47.840 --> 10:48.840
days.

10:48.840 --> 10:49.840
This is one core.

10:49.840 --> 10:53.960
When you go and you build and you design a single CPU core, that one thread that you see,

10:53.960 --> 11:01.280
they can be doing 8, 10, 12, there's designs trying to go to 16 and 24 in sanity.

11:01.280 --> 11:05.360
But they're parallel decoding all of these instructions, all at once.

11:05.360 --> 11:08.680
And then they look the same way that you would in a compiler and say, what is my dependency

11:08.680 --> 11:10.000
chain?

11:10.000 --> 11:14.920
Because the same way that not everything looks to be happening at the same time, nothing

11:14.920 --> 11:17.080
is happening at the same time.

11:17.080 --> 11:21.880
When I go and I say, I need you to load this, modify it and give it back to me.

11:21.880 --> 11:24.800
I don't know when that memory is going to arrive.

11:24.800 --> 11:28.640
Because on a laptop, my clock speed is changing because I want to save you power so you can

11:28.640 --> 11:29.920
get more work done.

11:29.920 --> 11:34.640
But that means now the clock distance, the effective distance from the core to the memory,

11:34.640 --> 11:35.640
that's changing on me.

11:35.720 --> 11:39.800
So I don't know the fixed time when data is going to arrive.

11:39.800 --> 11:42.040
But I can take advantage of that actually.

11:42.040 --> 11:48.040
Because now on the power side, I know that certain hardware takes longer to wake up.

11:48.040 --> 11:54.680
And I know when I can combine instructions together, I can fuse different dependencies together.

11:54.680 --> 11:57.760
So I can look and say there's a bunch of different vector operations.

11:57.760 --> 12:00.760
And I want to schedule them all, they don't have dependencies.

12:00.760 --> 12:04.080
But I'm going to go wait until I buffer them together.

12:04.080 --> 12:08.960
They cut my vector unit, do all of them at once, then bring it back and shut down the

12:08.960 --> 12:09.960
vector unit.

12:09.960 --> 12:13.240
And now I'm saving you tens of watts all the time.

12:13.240 --> 12:18.800
I'm saving you a ton of power and we're reordering everything at the same time.

12:18.800 --> 12:21.640
So how do you need to think about these CPU cores?

12:21.640 --> 12:24.840
You'll notice on that last slide.

12:24.840 --> 12:27.920
Notice how I have to split the device into two, right?

12:27.920 --> 12:30.320
This is a really, really deep pipeline.

12:30.320 --> 12:31.760
We have front ends and back ends.

12:31.760 --> 12:35.040
They execute different things at different times.

12:35.040 --> 12:38.320
But that still means back to the size of units.

12:38.320 --> 12:41.400
Some units are going to be significantly larger than others.

12:41.400 --> 12:43.240
And we need to move the data.

12:43.240 --> 12:45.280
We take time to do that.

12:45.280 --> 12:51.440
Which means now I have to amortize as much latency as I can, because the user perceives

12:51.440 --> 12:55.760
that it's going to take no time at all and that everything happens in order.

12:55.760 --> 12:59.440
But because of the amount of latency it's going to take to get from the top of the core

12:59.440 --> 13:03.760
all the way to the bottom, now I have to worry about, okay, how can I hide that?

13:03.760 --> 13:08.360
The very classic trick is I try to pre-fetch data and I'm pre-fetching data and I'm putting

13:08.360 --> 13:10.000
that in the branch predictor.

13:10.000 --> 13:14.200
Oh, crazy word there, branch predictor.

13:14.200 --> 13:19.520
So I'm assuming that the machine knows the way I'm going to fork better than I do.

13:19.520 --> 13:21.560
And the truth is, it kind of does.

13:21.560 --> 13:25.320
The basic example of a branch predictor where you get tons of performance and where you can

13:25.320 --> 13:28.280
amortize that latency, amortize those loads.

13:28.280 --> 13:29.600
Is it a loop?

13:29.600 --> 13:30.600
What is a loop?

13:30.600 --> 13:33.400
A loop is saying I'm starting this block of code.

13:33.400 --> 13:34.920
I'm checking a condition.

13:34.920 --> 13:38.680
If that condition hasn't happened, I'll go back to the top and I'll continue.

13:38.680 --> 13:40.320
I'm not going to break out of the loop.

13:40.320 --> 13:42.640
Okay, a very simple mental model then.

13:42.640 --> 13:44.520
How bigger most people's loops?

13:44.520 --> 13:48.280
I'm going to argue that most loops are bigger than two iterations.

13:48.280 --> 13:51.680
And because of that, you can very quickly determine, hereistically.

13:51.680 --> 13:57.600
Oh, because I assume that in the vast majority of cases, I'm not going to break out

13:57.600 --> 14:03.040
of the loop while I'm executing loop iteration one, I can go through in pipeline and do

14:03.040 --> 14:04.960
the fetch for two.

14:04.960 --> 14:09.240
But wait, if I'm already doing that, why don't I pre-fetch data for loop two?

14:09.240 --> 14:11.120
Loop three, loop four.

14:11.120 --> 14:15.440
And now I have this massive branch predicted machine.

14:15.440 --> 14:20.480
But is the code I'm running based on branches?

14:20.480 --> 14:24.480
So now we get into different classes of code, the mental model for how you approach device

14:24.480 --> 14:26.120
selection.

14:26.120 --> 14:29.240
When you're code in the problem and the math and the physics and the chemistry and

14:29.240 --> 14:34.680
the biology, and so many other problems are throughput optimized.

14:34.680 --> 14:41.040
It's the same operation, day in, day out, vertical lanes, no temporal or spatial dependencies.

14:41.040 --> 14:44.120
It's very easy for me to do the mapping of that problem.

14:44.120 --> 14:48.160
To, oh, I don't need all of that crazy branch prediction machinery.

14:48.160 --> 14:53.360
I would rather something that is naively parallel, massively wide, which we're going to get

14:53.360 --> 14:54.600
into in just a second.

14:54.600 --> 14:57.240
I'm going to run out of time and I'm going to be in trouble.

14:57.240 --> 15:03.560
So moving on, the not so simple building box.

15:03.560 --> 15:09.000
When you look at the modern core, not only is it going to memory, but I said pre-fetch.

15:09.000 --> 15:12.960
Pre-fetch means the data needs to be fetched from somewhere and put somewhere else.

15:12.960 --> 15:17.840
But because we're dealing with electrical components, we're dealing with real computers.

15:18.200 --> 15:22.880
The distance from one place to another is going to change.

15:22.880 --> 15:25.880
That distance can change and I don't know where it is.

15:25.880 --> 15:31.360
So when I look at the point of view from the core, my data is in my core or it's a level

15:31.360 --> 15:32.640
above me and a cache.

15:32.640 --> 15:33.880
Why do I cache?

15:33.880 --> 15:37.840
Because I hope that my data will be reused multiple times.

15:37.840 --> 15:41.520
And if that's the case, well, maybe one core needs to reuse data all the time.

15:41.520 --> 15:44.600
Think of constants in the lowest level of loop.

15:44.600 --> 15:49.160
But in other matrices and other problems, you're going to get cases where the core wants

15:49.160 --> 15:51.440
to share that data with other cores.

15:51.440 --> 15:53.560
So it has to have a shared cache.

15:53.560 --> 15:57.440
But now if I'm sharing data, one core can modify something.

15:57.440 --> 16:00.360
And the other core doesn't necessarily know I've modified it.

16:00.360 --> 16:04.800
So now I need to build additional machinery to go and make sure that everything is coherent.

16:04.800 --> 16:09.440
That everyone sees the same place in time, even though we're never at the same place

16:09.440 --> 16:10.880
in time in the machine.

16:10.880 --> 16:13.960
And you do that across levels.

16:14.000 --> 16:18.400
This is just a quick example of what a core looks like at the simplistic level.

16:18.400 --> 16:23.400
All the cores that can be connected in a mesh, and a torus, hypercube, there's 101 different

16:23.400 --> 16:25.200
ways to do this.

16:25.200 --> 16:29.040
And this gets back to when I'm thinking about the use case, what is a CPU determined

16:29.040 --> 16:30.040
for?

16:30.040 --> 16:35.800
A CPU's primary use case is scalar operations that it can be branch predicted.

16:35.800 --> 16:40.080
If that maps really well to your code, then you're in really good shape.

16:40.080 --> 16:43.000
But it turns out that's not always what you're working on.

16:43.000 --> 16:46.200
So let's look at a very simple GPU.

16:46.200 --> 16:48.800
GPU's graphics processing unit, but that's actually very boring.

16:48.800 --> 16:51.200
And that's not what we use them for anymore.

16:51.200 --> 16:56.000
What we use them for is massive vector programs.

16:56.000 --> 17:02.400
When we're distributing tons of data in those examples of horizontal threats, CPU's

17:02.400 --> 17:08.800
we are dealing with cases where data happens in a pipeline within that temporal level,

17:08.800 --> 17:10.760
and you have dependencies within each other.

17:10.760 --> 17:13.760
GPU's are looking at the world of, no, you're just doing the same thing, and that comes

17:13.760 --> 17:14.760
from their origin.

17:14.760 --> 17:16.760
They're looking at pixel programs.

17:16.760 --> 17:21.880
When you think of pixel programs, I'm just doing effectively a x plus b on a single pixel.

17:21.880 --> 17:24.520
I don't need to care what the rest of the screen is doing.

17:24.520 --> 17:29.120
I only need to care about what I'm doing, so I can build very simple, fixed function hardware

17:29.120 --> 17:31.200
for how that's going to operate.

17:31.200 --> 17:34.920
And then, okay, GPU's got more complicated because we started doing very pretty things

17:34.920 --> 17:39.680
like anti-aliasing, anti-aliasing is a form of blurring, well, a blur means I'm doing an

17:39.680 --> 17:41.680
average, I'm doing a local average.

17:41.680 --> 17:46.080
So now I need some form of connection between all the fixed function hardware, but I never

17:46.080 --> 17:51.120
need to worry about the case where the data is global, for the most part, asterix, two

17:51.120 --> 17:53.640
hour talk goes here.

17:53.640 --> 17:57.240
So when you have that type of connection that origin, you're still looking at it as, this

17:57.240 --> 18:01.120
is all vertical and non-dependent data, and I can build hardware for that, and that's

18:01.120 --> 18:03.320
why we do see hardware built for that.

18:03.320 --> 18:05.480
But you can't always trust that that's what your problem space is going to look

18:05.480 --> 18:06.480
to.

18:07.400 --> 18:13.040
And my worry about very high performance branch prediction, where conditions where changes

18:13.040 --> 18:18.160
happen are a problem, or fine, versus the other half of the world, the other extreme

18:18.160 --> 18:23.680
GPU's, where my hardware is so simple, I don't have time to deal with branches.

18:23.680 --> 18:28.840
And my assumption, the contract that I'm building with you, the user of that GPU, is that

18:28.840 --> 18:32.520
you are promising me, that to reach your performance, you will write codes, you will

18:32.520 --> 18:37.320
develop algorithms, such that you minimize the dependencies on different places.

18:37.320 --> 18:43.840
Through put optimized, branchy tree optimized, mental model for how to do that.

18:43.840 --> 18:48.360
This is very simple as a pipeline, you see you have an instruction cache, you're executing

18:48.360 --> 18:51.960
those instructions, and you're also assuming that those instructions are going to be spread

18:51.960 --> 18:55.480
out across the entire machine everywhere.

18:55.480 --> 18:56.480
Great.

18:56.480 --> 19:00.960
Now I have a warp scheduler, and I'm dispatching, same idea as before, with a very simple CPU

19:01.040 --> 19:06.040
core, the warp is just a group of threads together, and I'm saying all those threads

19:06.040 --> 19:10.120
are going to be looking at the same data, that's where the terms SIMT comes from, same

19:10.120 --> 19:12.840
instruction multiple threads, that's how they operate together.

19:12.840 --> 19:17.520
I am assuming that you always have full usage of the full machine, and if you don't, your

19:17.520 --> 19:22.440
performance will be crap, your code will be bad, and you should feel bad.

19:22.440 --> 19:25.840
With that, you have the register file, well it's all local, but because I'm doing these

19:25.840 --> 19:30.760
massive, amortized, wide vectors, it's very easy for me to make a very wide throughput

19:30.760 --> 19:31.760
optimized device.

19:31.760 --> 19:35.960
I'm not worrying about loading a single case, because from the point of view of the thread,

19:35.960 --> 19:40.040
I'm loading a single case, but everything is grouped together, so I can actually turn

19:40.040 --> 19:44.600
that into a very wide load, so now everything comes together at once, and I can pipeline

19:44.600 --> 19:47.760
and match that to the size of the underlying machine.

19:47.760 --> 19:52.240
Similarly, I'm going to run out of time old by a long shot.

19:52.240 --> 19:55.640
So how do you think about CPU cores together versus GPU cores?

19:55.640 --> 20:02.880
When you think of a GPU as being a massive group of connected GPU cores, you're using

20:02.880 --> 20:06.320
those very simple units that we saw at the very beginning, and instead of making the

20:06.320 --> 20:11.520
cores complicated, you're saying, as long as you promise to always give me very simple

20:11.520 --> 20:15.360
code, and a lot of it, I will be able to do that.

20:15.360 --> 20:18.520
But because I don't have the advanced machinery to be looking at different problems all

20:18.520 --> 20:23.640
at once, that also means that when a GPU has to deal with much more complicated machinery,

20:23.640 --> 20:28.280
more complicated controls, flow, especially, it's going to fall flat on its face, and

20:28.280 --> 20:30.960
that's something that you opt into.

20:30.960 --> 20:33.720
What am I accelerating?

20:33.720 --> 20:38.040
I was going to try to do a high level overview of how you do a high performance GPU.

20:38.040 --> 20:39.480
That should have been a 45 minute talk.

20:39.480 --> 20:40.640
We're not going to do that.

20:40.640 --> 20:44.680
I am just going to say at the high level though, when you're looking at the hierarchy,

20:44.680 --> 20:47.960
notice how earlier I was talking about the CPU hierarchy and going through trees.

20:47.960 --> 20:53.600
Well, I'm doing the same thing in a GPU, and yet why does my performance look so different?

20:53.600 --> 20:57.760
It comes out to the assumptions I can make, but why does every hardware designer build

20:57.760 --> 21:01.760
these massive devices with massive amounts of memory, and then these cascading trees

21:01.760 --> 21:04.240
and these cascading hiring hierarchies?

21:04.240 --> 21:06.440
This is a reality of silicon design.

21:06.440 --> 21:09.600
When we go and we implement your chips, we have IO, and we have compute.

21:09.600 --> 21:12.880
These are the two blocks that we effectively have to play with.

21:12.880 --> 21:18.480
But your IO, so that's your memory, your ethernet, your 5O's, your PCI, all of that,

21:18.480 --> 21:24.440
use the perimeter of the chip, and then all the compute, all your ALUs, those SRAMs,

21:24.440 --> 21:28.240
those cashes, those scale with the area of the chip.

21:28.240 --> 21:32.560
You want's here on lithographic design outside of the scope.

21:32.560 --> 21:36.520
What it means though is we start to see similar hierarchies because it's forced upon us

21:36.520 --> 21:39.720
by the reality of the hardware, the reality of the space.

21:39.720 --> 21:44.280
The choice becomes for the same amount of hardware, as we've scaled up hardware, we continue

21:44.280 --> 21:48.440
to have that bandwidth problem, that throughput problem, and the way we design the chip,

21:48.440 --> 21:50.880
has to reflect the realities of the device.

21:50.880 --> 21:54.440
Then choosing that time or choosing the code of the algorithms and so on that you're going

21:54.440 --> 21:58.000
to map to the chip, that's what matters a little bit on the mental side.

21:58.000 --> 22:03.200
I am out of time, we're going to skip FPGA's, I'm going to do this for a second.

22:03.200 --> 22:09.320
I had simple FPGA, that was a lie, no, okay, it is the last thought, I guess.

22:09.320 --> 22:12.640
So what is the machine solving for?

22:12.640 --> 22:15.640
As a program, how do you need to think about those GPU cores?

22:15.640 --> 22:18.760
You need to always be reading into your mental model.

22:18.760 --> 22:23.320
I can do things locally and they're fine, as long as they're within the group.

22:23.320 --> 22:30.760
These days GPUs are organized 95% of them in groups of 32, some GPUs support groups of 64.

22:30.760 --> 22:34.120
This is an implementation question, but the thing I need to be thinking about is, every

22:34.120 --> 22:38.880
time I go to do accesses, every time I go to distribute the data, you'll see groups of

22:38.880 --> 22:43.320
32, 32, 32, why am I doing that, why does that work that way?

22:43.320 --> 22:47.480
There's one, the contract with the hardware is, every time you're going to amortize data,

22:47.480 --> 22:51.720
I'm going to do those in groups of 32, or I'm going to do those in groups of 64.

22:51.720 --> 22:54.720
And that simplifies how the hardware is actually working.

22:54.720 --> 23:00.160
If it simplifies something, then I can throw more hardware, more time, more space, more energy

23:00.160 --> 23:02.760
at the portions of the machine that you actually care about.

23:02.760 --> 23:06.760
Instead of spending time on guarantees that we're not explicit, so you don't as the

23:06.760 --> 23:13.200
user have to intuit how the machine is doing things under the hood.

23:13.240 --> 23:17.240
PGA's, because I put, like an idea, I put that on my foster submission that I would talk

23:17.240 --> 23:19.480
about them, so let's talk about them.

23:19.480 --> 23:24.040
There is no such thing as a simple FPGA, I kind of skipped on that joke earlier, so let's

23:24.040 --> 23:27.960
talk about FPGA as more broadly.

23:27.960 --> 23:31.560
So what is the basic building block of an FPGA?

23:31.560 --> 23:32.560
What is an FPGA?

23:32.560 --> 23:35.280
Actually, that's a more interesting question.

23:35.280 --> 23:40.880
And FPGA, the acronym, because we love those, is a field programmable grid array.

23:40.880 --> 23:43.480
Okay, let's do that backwards.

23:43.480 --> 23:46.480
I have an array, right?

23:46.480 --> 23:48.320
Everyone here knows what an array is.

23:48.320 --> 23:54.080
That array is mapped to a grid, okay.

23:54.080 --> 23:59.800
It's programmable, so you have an array that's, it organizes a grid and it's programmable,

23:59.800 --> 24:00.800
okay, great.

24:00.800 --> 24:05.720
If you think about just those three terms, though, that doesn't sound that interesting,

24:05.720 --> 24:11.960
because you could map that any compute element is going to be a grid of arrays of compute

24:11.960 --> 24:14.400
elements, right?

24:14.400 --> 24:17.120
And obviously, we can program the device.

24:17.120 --> 24:21.600
So what makes FPGA so different and so interesting to so many people?

24:21.600 --> 24:26.280
Well, it's the, it's the field, what it means and what it implies in the contract with

24:26.280 --> 24:34.280
the user is I'm going to give you a massive set of resources and I'm going to make no assumptions

24:34.280 --> 24:37.360
about the type of programs you're going to run.

24:37.360 --> 24:44.000
I'm going to embrace true generality and I'm just going to connect everything together.

24:44.000 --> 24:49.200
And the field portion of that means that in the field, in your data center, in, on your robot,

24:49.200 --> 24:54.320
on your quadcopter, whatever you want, I can change the way devices are connected together,

24:54.320 --> 25:01.120
effectively through moxes, implementation details here, I can change how devices are connected.

25:01.120 --> 25:06.680
That also means though that when I'm designing an algorithm, I have infinite freedom.

25:06.680 --> 25:11.840
I also have infinite complexity, nothing is handled for me.

25:11.840 --> 25:16.720
It also means that as the hardware designer, as I do that contract with you, I can make

25:16.720 --> 25:21.080
no reasonable assumption of what you're working on, which means all the tricks I do in

25:21.080 --> 25:25.680
normal architecture to make the device fast, massive, amortized load stores as part of the

25:25.680 --> 25:26.680
comfortable.

25:27.000 --> 25:31.360
I have to assume that some portions are going to be extremely simple scalar loads, and

25:31.360 --> 25:33.640
some portions are going to be really complicated.

25:33.640 --> 25:37.360
I can make no assumptions of the way the direction everything is going to be.

25:37.360 --> 25:42.400
I can't make an assumption on, are you going to be bigger, little, and all these little

25:42.400 --> 25:45.280
tricks that the hardware people play, we can't use them.

25:45.280 --> 25:50.120
But we wouldn't use them if this entire concept wasn't usable.

25:50.120 --> 25:53.720
So why do we use them and where do they become interesting?

25:53.720 --> 25:58.160
They become interesting when you have a really weird wacky algorithm that represents

25:58.160 --> 26:03.960
a fundamental piece of physics that cannot be cleanly mapped to a GPU because you have massive

26:03.960 --> 26:05.200
dependencies.

26:05.200 --> 26:06.440
But it's also not branchy.

26:06.440 --> 26:08.720
I know the way the data is going to change.

26:08.720 --> 26:12.520
I just can't predict and amortize the way it's going to work, the way it's going to be

26:12.520 --> 26:13.520
used.

26:13.520 --> 26:18.880
And then in this case, the hardware programmer, the software programmer, all in one, the

26:18.880 --> 26:22.880
platform person, you have to worry about how all of these different devices are going to

26:22.880 --> 26:25.720
be connected.

26:25.720 --> 26:30.080
But it's not that simple because one of the things we decided a long time ago was, well,

26:30.080 --> 26:32.640
we don't know how you're going to connect devices together.

26:32.640 --> 26:37.000
We don't know exactly what algorithm you're going to do, but we know you want to do math,

26:37.000 --> 26:42.120
because everything we do is to do the math, just in weird wacky different ways.

26:42.120 --> 26:47.520
So what I can say is, I'll add a little bit of extra complexity and I'll add DSPs, digital

26:47.520 --> 26:50.920
signal processors, an accelerator.

26:50.920 --> 26:57.920
I'll go through and say, broadly speaking, at the scale of the whole device, almost everything

26:57.920 --> 27:02.200
will be look up tables and wires and mixes and so on to connect different portions.

27:02.200 --> 27:06.480
Because what we're worried about is not necessarily doing the math.

27:06.480 --> 27:10.160
But what we're worried about is how do I get the math where it needs to be?

27:10.160 --> 27:14.800
And when I'm doing my math, using tricks like DSPs, these can be much denser.

27:14.800 --> 27:18.800
These are much easier to design and these have a contract, so long as you're doing floats

27:18.800 --> 27:23.800
and not posits and so on, you can go through and pipeline and connect these devices directly

27:23.800 --> 27:28.360
together and flow the data spatially where it needs to be.

27:28.360 --> 27:30.840
But why is everyone doing this?

27:30.840 --> 27:35.440
Because HPC, diversity of workloads, when we're building clusters, we have 101 different

27:35.440 --> 27:39.600
problems on a good day, realistically it's more like a 1,000 and 1, and you look at the

27:39.600 --> 27:40.600
diversity of users.

27:40.600 --> 27:41.880
Well, what do they want?

27:41.880 --> 27:43.880
Are they doing small cases or big cases?

27:43.880 --> 27:47.920
The way I design a small cluster versus a big cluster, what's the topology of the next

27:47.920 --> 27:49.960
work and the cluster that they want to use?

27:49.960 --> 27:54.160
Some people want dragonflies, some people want toruses, some people want hypercubes, factories,

27:54.160 --> 27:55.160
everything in between.

27:55.160 --> 27:57.560
I don't know what you're going to do.

27:57.560 --> 28:02.240
So the NFPJ sounds perfect, I give you the user infinite control, but then I remind you

28:02.240 --> 28:05.800
you have papers to publish and you don't have infinite time.

28:05.800 --> 28:11.040
And I remind you that when I'm looking at the different work, I don't know what's coming.

28:11.040 --> 28:15.800
And I want it's more important to get the work done than it is to get it done perfectly

28:15.800 --> 28:20.400
a lot of the time because the goal is advancing the science, not ending the science.

28:20.400 --> 28:23.320
And we make small iterative approaches to be better.

28:23.320 --> 28:27.520
Some people will scoff at that philosophically and yeah, we can get into it.

28:27.520 --> 28:30.200
So what is the machine solving?

28:30.200 --> 28:32.360
How do you need to think about an FPGA?

28:32.360 --> 28:34.760
How do you need to think about the elements on it?

28:34.760 --> 28:40.240
You're looking at distributed options, distributed pieces of compute, DSPs and connections

28:40.240 --> 28:43.880
and networks that bring the data together.

28:43.880 --> 28:48.440
But because of that extreme generality that's presented to each and every user, I'm going

28:48.440 --> 28:51.800
to take more area, all those wires need to take up space.

28:51.800 --> 28:53.600
I'm going to burn more power.

28:53.600 --> 28:55.600
And it becomes an optimization problem.

28:55.600 --> 29:00.000
Do I have the time and do I have something that is representable in such a way that it

29:00.000 --> 29:04.880
makes more sense to do an implement something on an FPGA, then go and use something bespoke.

29:04.880 --> 29:08.960
At the same time, you run into unique problems, especially saying the defense space, where

29:08.960 --> 29:13.000
I have a very specific application that I need to accelerate.

29:13.000 --> 29:19.080
But I don't have a hundred million dollars to call up Intel or TSMC or global boundaries

29:19.080 --> 29:21.600
or Samsung to go and build a new chip.

29:21.600 --> 29:22.960
So this is a way to implement.

29:22.960 --> 29:26.760
So when across all these classes of problems, you have different hardware that is bespoke

29:26.760 --> 29:29.920
for the use case, different ways of approaching that problem space.

29:29.920 --> 29:33.640
And the truth for what's optional in your use case is somewhere in between.

29:33.640 --> 29:41.240
And yet, our humble CPU, what am I trying to solve for?

29:41.240 --> 29:45.360
I'm trying to solve for making sure that all your users are always taken care of.

29:45.360 --> 29:49.920
And the whole point of this talk is if you know what your users are working on, and you

29:49.920 --> 29:53.880
can map the time that they're going to be doing their computations, the underlying math

29:53.880 --> 29:57.560
to a specific device, you can and you should.

29:57.560 --> 30:00.360
But at the end of the day, you're also going to have a CPU.

30:00.360 --> 30:04.040
You're going to have general use cases that cannot be accelerated.

30:04.040 --> 30:08.080
You're going to have cases where it's going to be a simple pipeline that cannot be predicted.

30:08.080 --> 30:09.720
It has to be scalar.

30:09.720 --> 30:13.360
And for these problems, you're going to always need some sort of base device.

30:13.360 --> 30:17.320
Not every problem can be accelerated, but the level of acceleration changes based on the

30:17.320 --> 30:18.800
contract numbers.

30:18.800 --> 30:19.800
That's the talk.

30:19.800 --> 30:20.800
Thanks.

30:20.800 --> 30:21.800
Thanks.

30:21.800 --> 30:33.320
Thanks a lot, Felix, for deck the key out of the time, but there's no other speakers so

30:33.320 --> 30:35.320
we can take some questions.

30:35.400 --> 30:38.840
Does anyone have questions for Felix?

30:38.840 --> 30:44.920
Out of curiosity, how does hyper-training feed in this picture?

30:44.920 --> 30:46.920
That's an implementation trick.

30:46.920 --> 30:47.920
Yeah, sir.

30:47.920 --> 30:49.880
So where does hyper-threading fit in?

30:49.880 --> 30:55.400
So let's go to our humble CPU.

30:55.400 --> 31:00.560
When I look here, let's look at this one.

31:00.560 --> 31:02.400
Very simple.

31:02.400 --> 31:09.440
What hyper-threading is saying is to the programmer, I can have visibility of two things

31:09.440 --> 31:11.320
happening at once.

31:11.320 --> 31:15.560
But no two things are ever truly happening at once in the same core.

31:15.560 --> 31:20.520
The idea of hyper-threading is when one resources is locked up, the next thread can make

31:20.520 --> 31:22.400
forward progress.

31:22.400 --> 31:27.360
It only works when you're going you have multiple dispatch units, multiple units, multiple

31:27.360 --> 31:30.240
floating point ads and integers and vector units.

31:30.240 --> 31:34.640
So what I'm doing is back to the pre-fetcher discussion.

31:34.640 --> 31:39.240
I'm waiting on data for one thread, while the other threads data may be arriving.

31:39.240 --> 31:44.720
Because that data has arrived, I can make forward progress on that thread.

31:44.720 --> 31:48.640
So you basically, in a lot of cases, there's complexity and visibility problems.

31:48.640 --> 31:51.640
We don't want another spectrum meltdown.

31:51.640 --> 31:56.600
But if you're clever about the way you build the hardware, I basically can double the amount

31:56.600 --> 32:02.120
of threads I have for free, assuming that those threads are not contending for the same

32:02.120 --> 32:03.480
resources all the time.

32:03.480 --> 32:07.120
That's an important nuance here.

32:07.120 --> 32:10.840
Hyper-threading came out in simultaneous multi-processing at 17 and so on.

32:10.840 --> 32:14.320
They all came out of the need of cloud providers, broadly speaking, who are running a lot

32:14.320 --> 32:15.920
of different types of codes.

32:15.920 --> 32:20.520
Web servers, sometimes it's mostly like web servers in different applications, databases,

32:20.520 --> 32:21.520
and so on.

32:21.520 --> 32:24.920
And what that meant is that all these different codes, all these different virtual machines,

32:24.920 --> 32:29.440
the operating system threads, and so on, were all adding contention to different portions

32:29.440 --> 32:30.680
of the core.

32:30.680 --> 32:32.680
They weren't always hammering the exact same unit.

32:32.680 --> 32:36.440
Because if you are hammering the exact same unit, I can do a little bit of extra work,

32:36.440 --> 32:38.640
while waiting for memory to come and go.

32:38.640 --> 32:42.480
But if I'm running a lot of different threads, doing a lot of different things, it's very

32:42.480 --> 32:47.840
easy for me to add an extra thread that handles all of this nonsense for user zero, and

32:47.840 --> 32:49.760
all of this nonsense for user one.

32:49.760 --> 32:54.320
And I can have these two programs effectively going back and forth on the resources, making

32:54.320 --> 32:56.040
sure that you have higher occupancy.

32:56.040 --> 33:11.680
And then it becomes a game of well, I can do more with my core.

33:11.680 --> 33:15.080
When would I personally choose a data flow architecture?

33:15.080 --> 33:18.680
The camera's running.

33:18.680 --> 33:21.360
When I wake up in the morning and I hate myself.

33:22.280 --> 33:27.520
No, I kid, there are very reasonable data flow architectures that exist.

33:27.520 --> 33:34.320
The premise of a data flow architecture is typically what's known as a CGRA, a coarse grain

33:34.320 --> 33:36.200
reconfigurable array.

33:36.200 --> 33:41.160
So we had the idea in FPGA is of mostly I'm looking at where the data is going to be.

33:41.160 --> 33:46.000
And then I go through and I have small DSPs, because it's field programmable.

33:46.000 --> 33:50.040
In this case, we're looking at coarse grain, version of that, so I'm going to have much bigger

33:50.040 --> 33:51.040
DSPs.

33:51.040 --> 33:55.920
And more than anything, I'm looking at how can I flow the different portions of the data

33:55.920 --> 33:58.200
into being reaccumulated.

33:58.200 --> 34:01.760
But notice now, I'm in a weird middle ground.

34:01.760 --> 34:05.720
I'm saying there is a contract, but you don't have to follow it, which also means that

34:05.720 --> 34:11.360
as the programmer and as the user, you don't necessarily know where in the bounds of reality

34:11.360 --> 34:12.760
you can operate.

34:12.760 --> 34:16.400
You don't know exactly where things are going to be at what time and how quickly certain

34:16.400 --> 34:17.760
portions are.

34:17.760 --> 34:21.400
The other thing is, frankly, from a silicon and hardware standpoint.

34:21.400 --> 34:27.800
Now I'm looking at, I don't need as many redundant wires as I do in FPGA, because

34:27.800 --> 34:30.480
in FPGA is like, go and do whatever you want.

34:30.480 --> 34:34.760
A CGRA says, I need to double triple quadruple, maybe 10x the amount of wires I would need

34:34.760 --> 34:38.520
for a day sick design, but I'm only ever going to be using one of those wires, one of

34:38.520 --> 34:39.920
those connections.

34:39.920 --> 34:44.680
So now I've got a lot of extra redundant hardware that I can't use and I can't make

34:44.680 --> 34:45.680
use of.

34:45.680 --> 34:49.080
Those connections aren't being used in the same way that on your motherboard, if not using

34:49.080 --> 34:53.200
all of your ethernet ports and you're leaving them unpopulated or you don't populate

34:53.200 --> 34:54.560
all the dims in your server.

34:54.560 --> 34:57.120
You're missing it on bandwidth and you can't do anything with that, but you still page

34:57.120 --> 34:58.120
for it.

34:58.120 --> 34:59.120
Right?

34:59.120 --> 35:00.120
There's no value there.

35:00.120 --> 35:01.520
Now, at the same point, there's an argument.

35:01.520 --> 35:05.480
Nexilicon is a current architecture in the space that's doing something really interesting

35:05.480 --> 35:10.360
with a CGRA, where they're saying you'll have forward programmability.

35:10.360 --> 35:13.200
The data must always flow in the same direction as a CGRA.

35:13.360 --> 35:16.760
There's a, I can't believe it's not a CGRA.

35:16.760 --> 35:20.400
And that allows me to know at least the directionality when I'm doing my designs.

35:20.400 --> 35:22.240
So I don't have to do copies back and forth.

35:22.240 --> 35:26.280
My designs can be forward feeding instead of rear feeding except for like a backup connection

35:26.280 --> 35:28.320
and I can achieve higher useability that way.

35:28.320 --> 35:32.320
But I'm not solving that problem and it becomes a portus for courses.

35:32.320 --> 35:33.920
There is no correct accelerator.

35:33.920 --> 35:35.720
There is no correct design.

35:35.720 --> 35:40.920
There are use cases in workloads and mapping of those workloads too hardware and if I can get

35:40.920 --> 35:44.600
that hardware and make my users happy, then use that.

35:44.600 --> 35:45.600
Right?

35:45.600 --> 35:47.960
Now, GPUs are at a point where they're finally programmable for general purpose users.

35:47.960 --> 35:53.280
For the longest time, like how many people here remember to do physics for you must draw a triangle.

35:53.280 --> 35:54.280
Right?

35:54.280 --> 35:56.280
That's how we did math on GPUs for the longest time.

35:56.280 --> 35:59.440
That's thankfully no longer the case.

35:59.440 --> 36:00.760
And now we fix those interfaces.

36:00.760 --> 36:05.760
And we express that contract and we refine that contract such that users could make productive

36:05.760 --> 36:07.760
use and get on with solving their actual problems.

36:35.760 --> 36:41.800
Is that the way that we want to move forward to basically have accelerated this for every single

36:41.800 --> 36:43.800
thing that's our program or the company.

36:43.800 --> 36:49.040
But then users on your problems, eventually to make it for it.

36:49.040 --> 36:53.040
Yeah, that's going to be a long way to repeat.

36:53.040 --> 37:00.480
So the question is if everything is an accelerator, how do I present a cohesive front

37:00.480 --> 37:05.440
to my users so that they can make productive use of different types of accelerators?

37:06.440 --> 37:07.440
Resil.

37:07.440 --> 37:12.440
My answer to that is you define the contract.

37:12.440 --> 37:18.200
And I mean that very, very, like I know it's broken record at this point, but you have

37:18.200 --> 37:20.640
to define the contract.

37:20.640 --> 37:26.560
If I tell you that you want to use these, I need to present, I'm going to go off script.

37:26.560 --> 37:32.040
What are the things that's important here is in programming language design, for example.

37:32.040 --> 37:36.600
If you want users to be able to use something, you need to inform them on how they need

37:36.600 --> 37:38.880
to approach that usage.

37:38.880 --> 37:43.960
It's not very useful for Ikea to send me flat pack furniture if I have no concept of

37:43.960 --> 37:45.760
a screwdriver.

37:45.760 --> 37:49.400
But as a person that lives in this world, I know the concept of a screwdriver.

37:49.400 --> 37:53.400
And I have instructions that say this is how you use it to put it together to do something

37:53.400 --> 37:54.400
useful.

37:54.400 --> 37:59.720
However, if I were to just get a bunch of two by force and I said, I wanted someone to build

37:59.720 --> 38:03.240
me a boat that's not exactly useful.

38:03.240 --> 38:08.560
So then the question becomes, okay, I have different tiers of users and different types of

38:08.560 --> 38:09.560
users.

38:09.560 --> 38:12.720
And how do I want to express that complexity?

38:12.720 --> 38:17.600
Because this is a talk I used to give called meeting users where they're at.

38:17.600 --> 38:20.120
Every type of user has a different goal.

38:20.120 --> 38:22.040
I'm an assembly and hardware person.

38:22.040 --> 38:23.320
I live at that interface.

38:23.320 --> 38:26.800
That's where I'm interested, which means I want to define really clean interfaces that

38:26.800 --> 38:28.280
are very ergonomic.

38:28.280 --> 38:33.080
I want to make sure that the first-year master's student who's never even heard of

38:33.080 --> 38:38.280
MPI, let alone open MP and so on can make productive uses of hardware.

38:38.280 --> 38:42.520
At the same time, I need to go and talk to my colleagues and say, well, how wide is that

38:42.520 --> 38:45.280
vector load store really need to be?

38:45.280 --> 38:47.240
And these are different worlds.

38:47.240 --> 38:52.600
So when you're defining the interface, the contract between levels becomes very important.

38:52.600 --> 38:53.840
How do I express it?

38:53.840 --> 38:57.680
And the unfortunate reality is that those contracts break.

38:57.680 --> 39:04.400
Contract here is analogous to a, to a lot of things, to an abstraction.

39:04.400 --> 39:09.680
Because what the abstraction tells you is, I'm going to deal with this for you.

39:09.680 --> 39:16.360
So long as you give me well-formed data, but well-formed does not mean, it means you

39:16.360 --> 39:19.240
are guaranteed how you get the answer at the end.

39:19.240 --> 39:21.720
I will implement exactly what you give me.

39:21.720 --> 39:25.840
It does not mean I will give you the perfect optimal implementation no matter what you

39:25.840 --> 39:28.920
give me and you don't have to think anymore.

39:28.920 --> 39:33.520
It's also why things like, oh, I'm going to jump from Python directly to the size of an

39:33.520 --> 39:38.720
F of the FPU on a device and get infinite performance, they're not going to work in current

39:38.720 --> 39:39.720
problems.

39:39.720 --> 39:44.920
You'll see high-level languages, high-level synthesis, especially where they say any grad student

39:44.920 --> 39:51.680
can give me MATLAB and I will go and program a 10,000 FPGA cluster and I will get 99.99

39:51.680 --> 39:52.680
performance.

39:52.680 --> 39:54.880
No, that's just not how it can happen.

39:54.880 --> 39:59.200
Like, oh, power to you, go try, let me know when you're done.

39:59.200 --> 40:03.280
So that being the case when we're defining those interfaces, keeping a contract between levels

40:03.280 --> 40:08.280
of how do I need to think about and react to it, is how I think about it.

40:08.280 --> 40:11.480
Because yes, the hardware is always heterogeneous.

40:11.480 --> 40:16.560
Some portions are always going to be accelerated, but you don't need to know that.

40:16.560 --> 40:19.000
You need to know I'm going to get the correct answer and if I give it to you in this

40:19.000 --> 40:25.120
particular way, in this particular consistent approach, I will get more out of the hardware.

40:25.120 --> 40:31.640
And then you talk to, well, I'm in assembly now, I'm in a compiler IR, I'm in MLIR, I'm in a

40:31.640 --> 40:35.120
language above that, at different levels of semantics and I can express different things.

40:35.120 --> 40:38.000
It's also why we have different levels of programming languages.

40:38.000 --> 40:43.160
We have your season, your four trends, for people with taste.

40:43.160 --> 40:46.760
And then we go above that and implement different ideas and different concepts, so I'm

40:47.000 --> 40:49.000
getting this some side eye.

40:49.000 --> 40:53.000
We should wrap up here, so we don't get in trouble with the closer people who want that.

40:53.000 --> 40:54.000
All right.

40:54.000 --> 40:56.560
But, actually, it doesn't end here.

40:56.560 --> 41:00.920
They can stick around each piece of social, it's a website, the weekly huddles, like

41:00.920 --> 41:05.960
every week, Felix is often there, so you want to hear him.

41:05.960 --> 41:06.960
Brandt.

41:06.960 --> 41:07.960
Brandt.

41:07.960 --> 41:08.960
Brandt.

41:08.960 --> 41:13.200
I think he's going to connect online as well, but not only with Felix, but everybody

41:14.120 --> 41:18.880
Yeah, so that's actually just before the class, one of the great things about hpc.socials.

41:18.880 --> 41:23.440
We have everything from the people that design the super computers for Azure to first

41:23.440 --> 41:24.600
your undergrad.

41:24.600 --> 41:26.840
It's an open form for anyone and everyone.

41:26.840 --> 41:30.800
The SPAC people are there, the easy build people are there, LVM, GCC people, people that

41:30.800 --> 41:35.080
work on the standards committees for how we actually implement languages, how we implement

41:35.080 --> 41:36.080
MPI.

41:36.080 --> 41:37.640
What are the interfaces between MPI?

41:37.640 --> 41:40.640
Like, what's the ABI standard that, well, what do you do?

41:41.300 --> 41:42.260
Yeah.

41:42.280 --> 41:44.600
Yeah.

41:44.600 --> 41:49.260
Thank you.

41:49.260 --> 41:53.980
There's obviously try amplposts for Felix a vote next.

41:53.980 --> 41:55.860
Count Hpc.

