WEBVTT

00:00.000 --> 00:13.960
So, next talk, I have a pleasure to introduce Ramon, who will be talking about closing the loop,

00:13.960 --> 00:16.360
and self-living compilers.

00:16.360 --> 00:24.680
I've heard today that there is an NVIDIA published AI agent generated AI generator for

00:24.680 --> 00:29.960
to generate models, which is six times slower than the one written by people.

00:29.960 --> 00:35.640
So, let's see if we can do a much better work with self-learning compilers for AI.

00:35.640 --> 00:37.640
Let's see.

00:37.640 --> 00:38.640
Hi.

00:38.640 --> 00:39.640
Hi.

00:39.640 --> 00:41.840
So, my name is Ramon Rush.

00:41.840 --> 00:46.520
I represent Daisy Tuner, where I work as a compiler engineer, and I'll be presenting,

00:46.520 --> 00:49.520
closing the loop, a self-learning compiler for AI accelerators.

00:49.520 --> 00:50.520
Okay.

00:50.520 --> 00:54.920
So, I think it's pretty safe to say we live in the age of hardware accelerators.

00:54.920 --> 01:01.000
Early a month goes by where there's not news about a new accelerator or even a new architecture.

01:01.000 --> 01:03.440
So, what does that mean for the user?

01:03.440 --> 01:08.960
First, we have to choose what kind of accelerator would we want to target if we determine

01:08.960 --> 01:10.880
that we need acceleration.

01:10.880 --> 01:16.240
And then there's always the balance between portability and specialization.

01:16.240 --> 01:21.040
You can remain agnostic to the actual hardware accelerator, easily swap them out if you

01:21.040 --> 01:25.360
remain on the right side, but you'll properly sacrifice a lot of performance, and you

01:25.360 --> 01:31.040
can get in the best performance on the left side, but probably looking at vendor lock-in or

01:31.040 --> 01:33.040
API lock-in.

01:33.040 --> 01:40.600
So, let's look at like an example stack, a user that comes with a portable system that hasn't

01:40.600 --> 01:48.360
used in accelerator before, the problem proves that it works on complex problems.

01:48.360 --> 01:52.000
And we still can here see the balance between generality and specialization.

01:52.000 --> 01:55.600
If you want the best performance, you would write directly for the instructions at architecture

01:55.600 --> 02:01.000
of the accelerator, but the heart of this problem is the mapping.

02:01.000 --> 02:05.240
Here we can get all the performance or lose all the performance, but I'll get back to that

02:05.240 --> 02:06.240
later.

02:06.240 --> 02:08.920
First off, let's look at what we want to do.

02:08.920 --> 02:15.280
We want to build a toolbox that allows for optimized mapping between all sorts of different

02:15.280 --> 02:20.960
accelerators from basically any kind of input language or format.

02:20.960 --> 02:21.960
Yeah.

02:21.960 --> 02:24.600
So, let's look how we do with that.

02:24.600 --> 02:30.480
First off, the representation on which we think, so you get a better idea on which level

02:30.480 --> 02:32.520
we're tackling that mapping.

02:32.520 --> 02:38.440
So we call our representation structured data of a stateful data flow multi-graphs.

02:38.440 --> 02:41.080
Let's start out with a simple data flow graph.

02:41.080 --> 02:45.800
We have access notes, which represent variables in this case.

02:45.800 --> 02:50.080
We have tasklets, which is the basic math operation, and we have the edges, which we call

02:50.080 --> 02:51.080
memlets.

02:51.080 --> 02:52.080
This would be a load.

02:52.080 --> 02:53.880
Here would be a store operation.

02:53.880 --> 02:55.920
We can make this more complex.

02:55.920 --> 03:00.560
We can add dependent data flow on any very simple, very easy to understand this format.

03:00.560 --> 03:04.080
We can add parallel execution in that graph.

03:04.080 --> 03:08.720
We can add offsets and operate on arrays instead of scalar variables.

03:08.720 --> 03:10.280
Still looks pretty simple.

03:10.280 --> 03:13.000
Then we can wrap this in a loop.

03:13.000 --> 03:16.360
In this case, it would be a data parallel loop.

03:16.360 --> 03:20.960
And the memlets are now indirect accesses, and we can take this further, wrap it into

03:20.960 --> 03:21.960
more loops.

03:21.960 --> 03:24.440
Then we have the naive math implementation.

03:24.440 --> 03:30.680
Here the memlets are multidimensional accesses with a types attached technically.

03:30.680 --> 03:37.200
So we can take this one abstraction further and just represent a call.

03:37.200 --> 03:42.440
In this case, our memlet edges become just address calculations, no memory access, and

03:42.440 --> 03:47.680
the call could be anything to other graphs, to external functions, in some kind of environment

03:47.680 --> 03:52.800
or interendicts, which are just well-known functions, like gem that we could replace with

03:52.800 --> 03:57.640
the underlying code, or just call something else.

03:57.640 --> 04:00.480
Then lastly, we have conditional execution.

04:00.480 --> 04:05.520
It's also pretty simple, and you can chain all of these elements for larger programs.

04:05.520 --> 04:11.400
To summarize, the format expresses data flow and control flow at the same time.

04:11.400 --> 04:12.640
It's structured.

04:12.640 --> 04:17.000
You can see the loops directly, so that makes it very convenient if you're talking about

04:17.000 --> 04:20.240
optimizing, transforming, changing loops.

04:20.240 --> 04:26.280
It's a lot more tors than other formats with us, and it's simpler and easier to analyze.

04:26.280 --> 04:28.320
What it doesn't allow is a regular control flow.

04:28.320 --> 04:32.920
So you cannot represent a program that contains just like branch help.

04:32.920 --> 04:37.960
You need to simplify that, but we don't think it's a large problem, because most pseudo-code

04:37.960 --> 04:43.480
is structured and doesn't have that problem.

04:43.480 --> 04:47.800
Also, you might have noticed the address and condition calculations are in a different

04:47.800 --> 04:51.480
format than the data flow, especially for parallel problems.

04:51.480 --> 04:57.480
This is very convenient, and the more separate they are, the more parallel the problem

04:57.480 --> 05:00.240
tends to be.

05:00.280 --> 05:06.160
Okay, so what are the elements that we have in our toolbox so far?

05:06.160 --> 05:12.360
So we have our SDFGs, APIs to build them, we have serialization so far it's just JSON,

05:12.360 --> 05:14.320
more to come.

05:14.320 --> 05:21.440
Then we have analysis passes similar to compilers that operate on loops, data dependencies

05:21.440 --> 05:28.080
and parallelism, and we have optimization passes that can optimize the loops, transform

05:28.080 --> 05:35.200
it to parallel execution, and so on, and right now our output is CC++ code generation.

05:35.200 --> 05:40.160
So this way we are currently very agnostic of what comes behind it, any CC++ compiler

05:40.160 --> 05:46.920
could compile it to an architecture, just like what I showed in the stack graph.

05:46.920 --> 05:52.400
In terms of front ends, right now part of our library are Python bindings, so you can build

05:52.400 --> 05:58.000
those graphs from Python, and we also have a decorator, and let's look at example.

05:58.000 --> 06:04.320
If we have this simple code, we can annotate it with our annotation, and when it executes,

06:04.320 --> 06:09.760
our library will pass that code, transform it into our graph representation, and if you let

06:09.760 --> 06:17.080
it do a default work, it will output CC++ code compile it, load it into the runtime again,

06:17.080 --> 06:22.800
and execute it, so it will just make simple Python code faster, so far, but it's a good

06:22.800 --> 06:26.480
platform to experiment and look at what the representation does.

06:26.560 --> 06:36.560
So this would, for example, be our graph, that then executes, okay, we also are working on,

06:36.560 --> 06:43.600
I believe it's been pushed today into the git, it's not really doing much so far, but

06:43.600 --> 06:50.720
more to come, a front end to the ingest PyTorch models directly without having to go to Python

06:50.800 --> 06:59.200
code, and also working on ingesting MLIR graphs directly. On the target side, we have the generic

06:59.200 --> 07:04.160
host target, which basically uses the normal CC++ code generation, so basically if you want

07:04.160 --> 07:10.640
to execute it on the CPU, we have an open MP back end, which basically adds open MP annotations

07:10.640 --> 07:17.120
to the output code, so it runs on multiple cores, we use Google Highway for vectorization, so we can

07:17.200 --> 07:28.720
use that, and we also have a CUDA back end, so take a look at that, okay, so for CUDA, so far we had

07:28.720 --> 07:35.760
normal code linear code runs on the CPU, now we want to offload that, so we need to cut out a portion

07:36.560 --> 07:41.760
and we need to generate code that runs on the device, we need to run the, generate the host code

07:41.760 --> 07:48.160
and we need to generate memory management and transfers to the device and back, as a graph,

07:48.160 --> 07:53.600
this would look something like this in the middle, the same mutmal graph from before and just the

07:53.600 --> 08:01.200
memory transfers above and below it, and what we also added on top is optimizations around the

08:01.200 --> 08:07.280
data structures, so if you were to execute this twice, the second one on the results of the first,

08:07.360 --> 08:11.920
it would be a ways to transfer that data back to the host and back to the accelerator again,

08:11.920 --> 08:16.160
and we can just remove those transfers, similar with read-only data that has never changed,

08:16.160 --> 08:22.800
it can just remain on the device, and this optimization is also generic, it could apply to any

08:22.800 --> 08:30.240
accelerator that is modeled as a target for this library, okay, so let's look at how well this works

08:30.240 --> 08:38.000
so far, this is NP-bench, popular Python benchmark, uses a few different back ends also with decorators,

08:38.000 --> 08:44.720
just as us, so they just in time replace the code that executes and measure the performance,

08:45.440 --> 08:52.240
and NumPy is the baseline on the bottom in milliseconds and everything else is relative to

08:52.800 --> 08:59.120
NumPy green indicates speed up, red indicates a slowdown, so these are just the frameworks that come

08:59.200 --> 09:08.880
with NP-bench, and we run it on paper size, okay, so this is baseline, we don't do any crazy

09:08.880 --> 09:14.000
optimizations yet, it's basically just cleaning up what we get from Python, and the C code we compile

09:14.000 --> 09:22.000
with a clang on O3, and you can see we are achieving some speedups, there are a few gaps where

09:22.000 --> 09:28.480
we don't map all operations just yet, but we're actually pretty happy with how good that works,

09:28.640 --> 09:35.040
if we enable OpenMP, it still works, right now we're not seeing that much speedups because

09:35.040 --> 09:41.120
it's annotation based, so if it's just a library call that's emitted, the annotation won't actually

09:41.120 --> 09:49.360
do anything to paralyze it, and the CUDA back end works and can upload code that was not optimized

09:49.360 --> 09:57.040
for any GPU or anything, and we are also pretty happy that we're beating in some cases the CUPA implementation

09:57.120 --> 10:05.120
already. So, but as I said in the intro, mapping is the heart of the problem, so how do we map

10:05.120 --> 10:11.680
something such as this to various accelerators? For example, here a really a CUDA architecture,

10:12.240 --> 10:18.320
which is actually pretty similar to a normal CPU in terms of memory topology and address space,

10:18.320 --> 10:23.760
it's one shared address space, any processor can access any data, it just determines the speed in

10:23.760 --> 10:29.920
which it does it. But we can also go crazy, and for example, look at the tense-turned architecture,

10:29.920 --> 10:35.760
which is just a mesh of many connected cores or need to sum up the neighbors, so here if we

10:35.760 --> 10:40.160
wanted to get access to some data, we need to move that data through all those cores,

10:40.160 --> 10:45.680
we need to consider all the bottlenecks we created on the way, on each core, and even if we get the

10:45.680 --> 10:50.720
data to its target, it's still not a normal core, it's still five cores that need to move

10:50.880 --> 10:57.040
on stream data around between the various accelerators. So, the mapping is at its height

10:57.040 --> 11:02.160
about parallelism, on the different levels, CPU cores, threats, vectors, it's about the memory

11:02.160 --> 11:08.240
topology and locality, and it's about data movement, and you can also further do data format

11:08.240 --> 11:14.000
and position as we have before. And if you do all of this right, you get a good performing corner.

11:15.120 --> 11:20.560
Right now, the standard way to do is to basically build a library of hard-coated or hand-optimized

11:20.560 --> 11:25.920
kernel from experts for a given accelerator. Very popular math functions, you can just build a

11:25.920 --> 11:33.200
library for a few of those and have good performance. The hard-coated, agnostic way, sorry,

11:33.200 --> 11:39.120
I turned those around. Yeah, there are already solutions to do this hard-coated, agnostic way

11:39.120 --> 11:44.720
that automatically map this, but they don't get close in performance to the manual performance

11:44.960 --> 11:51.200
engineering. Okay, so to explain manual performance engineering, let's look at the short example

11:51.200 --> 11:57.200
again, the simple mathematical application, and this would be the Nvidia example how to run

11:57.200 --> 12:04.560
that fast in CUDA, and just to show you what goes into it. So, here's code to initialize the hardware

12:04.560 --> 12:08.960
on allocate memory. We have data transfers to the accelerator because it's a separate memory.

12:09.920 --> 12:17.120
We have to match the exact hardware size we have. Here, this hides the two outer loops because

12:17.120 --> 12:22.320
the hardware managers are those. Then this is where the code that runs on the accelerator starts.

12:23.680 --> 12:30.080
We allocate a new memory and pre-fetch data into that memory before we calculate it. We pre-fetch

12:30.080 --> 12:34.320
in a different order to maximally utilize cache lines, so we fetch all the data linearly.

12:35.280 --> 12:39.920
We still accumulate in the existing order. We need to manage multiple threats that we didn't

12:39.920 --> 12:47.360
have before, and we're saving us of right and to memory. So a lot of stuff, but also we can break

12:47.360 --> 12:54.960
this down as a collection of building blocks. Most of those optimizations are like textbook optimizations.

12:57.040 --> 13:03.680
They have parameters and you to figure out on what to apply it and when, but it's basically building

13:04.320 --> 13:08.880
blocks. So if you are enough of an expert in the hardware, you can pick the standard text book

13:08.880 --> 13:14.320
transformations, pick their parameters, pick on which parts of the algorithm you apply to,

13:15.120 --> 13:20.400
and then the order of it, and then instead of looking at the complex kernel I showed on the right

13:20.400 --> 13:26.960
side, you can look at the basic pseudo code on the left side and a recipe that is specialized

13:26.960 --> 13:32.080
for the algorithm and the specific hardware. We'll get you the same result, but this way you can look

13:32.640 --> 13:37.280
at the base algorithm and the recipe is what determines what hardware you can run it on.

13:37.280 --> 13:40.400
And as long as you get the recipe at a run anywhere as fast as it can.

13:42.640 --> 13:50.800
An example for the recipe left the simple code and the transformation we're applying is loop

13:50.800 --> 13:55.920
tiling. So we're splitting up loop into two. The outer loop will make bigger jumps, blocks of

13:55.920 --> 14:00.000
32 in this case, and the inner loop will iterate over the elements sequentially as before.

14:00.800 --> 14:08.320
And we can do this to the outer loop as well. Then for example, we could run loop interchange,

14:08.320 --> 14:13.040
we just swap two loops. They don't have their own code, so we can do that. Without affecting

14:13.040 --> 14:21.440
the inner most code, we get that and why do we do this? Okay, now here we can see the green loops

14:21.440 --> 14:26.800
basically just operate on the rect angular on the right side. And the outer loops work over the

14:26.800 --> 14:32.560
rest of the data. So what that gets us is very localized data, which for example could fit into

14:32.560 --> 14:40.000
a cache of one processor core. So basically this is what we would run on the multiple cores,

14:40.000 --> 14:46.560
and the inner code is what would run on one of them cores, to optimally paralyze something like this.

14:46.560 --> 14:51.680
And we can repeat the same concept on the inner loop to exploit vectorization, for example.

14:51.680 --> 14:59.440
But where do we get the fast recipe? We talked about how we can apply it, but not where we

14:59.440 --> 15:08.400
do actually get that. We have the building blocks. Okay, there is a standard solution for this,

15:08.400 --> 15:13.440
it's called outer tuning. But what outer tuning is is basically just trying all the possibilities,

15:14.080 --> 15:18.160
spend a lot of time on it, and eventually you find the best performing solution for whatever

15:18.240 --> 15:24.400
performance target you had. This takes a long time, it's infeasable to do this every time you

15:24.400 --> 15:29.760
want to compile something or build an application. So what we really want is we want to cache on

15:29.760 --> 15:35.680
this remote server that you don't have to see those results, the recipes. So once we have it,

15:35.680 --> 15:40.720
everybody can use the recipe without having to know anything about the accelerator, and as long as

15:40.720 --> 15:44.800
the recipe exists, you can switch the accelerator and gain the best performance for that hardware,

15:44.800 --> 15:52.960
as well. Okay, so, but there are many accelerators, many different architectures, and even for

15:52.960 --> 15:58.640
the same architecture, the optimizations can change, because there are microtactual implementations

15:58.640 --> 16:06.240
like cache sizes, which change how to get the best performance out of it. Okay, so what we want

16:06.240 --> 16:11.840
on top of the cache is similarity based to search. We don't just want to find the exact match,

16:11.920 --> 16:17.360
we want to find any algorithm that was similar on similar hardware and get the recipe for this

16:18.160 --> 16:23.920
because it will get you similar performance on code we haven't seen before. This would look

16:23.920 --> 16:29.680
something like this, a database of similar algorithms, and we can port over the

16:29.680 --> 16:37.520
optimization recipes from a slightly different code. Okay, so we are using a neural network to

16:37.520 --> 16:44.320
classify the code, so to get the coordinates in that diagram, and we store for each point in

16:44.320 --> 16:49.360
the diagram, the optimization recipe, the results we observed, and we can do that with multiple

16:50.080 --> 16:54.880
target architectures, and then you basically need a server farm to do auto tuning nonstop

16:54.880 --> 16:58.720
for every architecture that you want to support to find the recipes for every code you know of.

16:59.840 --> 17:05.600
And this is self learning, if you feed it new code, it can figure out how to optimize that,

17:05.680 --> 17:10.560
if you give it a new target architecture and you accelerate, it can learn how to do that as well.

17:12.560 --> 17:18.480
To give you preliminary showcase of how well that is working out, we're using polybench,

17:18.480 --> 17:24.880
this is running our internal C++ LLVM front end, so we're running on C code instead of the

17:24.880 --> 17:32.240
python I showed before, it shows the speed up of already vectorized code by us if we apply our

17:32.320 --> 17:37.440
transfer tuning to it, and the transfer tuning database we use is currently very limited because

17:37.440 --> 17:42.880
it's very fresh. Okay, same type of graph, so you can see there are some benchmarks that

17:42.880 --> 17:49.840
profit by as much as fact the two of the transfer tuning, and locally it was not much work for the

17:49.840 --> 17:55.120
compiler to do this because somebody else already ran the auto tuning and figured out what to do,

17:56.080 --> 18:02.400
but there are some outliers where we make it slower by a lot. We hope to fix that soon.

18:03.840 --> 18:10.160
Okay, so in terms of building blocks, we're adding also the remote optimization and our public

18:10.160 --> 18:16.640
repository includes a demo server that basically responds with three set responses of what to optimize for

18:17.680 --> 18:21.680
because the server is not live yet in that way, but the client code is already there.

18:22.000 --> 18:28.960
And internally, as I mentioned, we are already working on doing that from C++ code directly

18:28.960 --> 18:37.120
and the transfer tuning database, but those are not that radiate. Yeah, and we hope to gain

18:37.120 --> 18:43.440
contributions for different accelerators if we don't get to them ourselves and new front ends,

18:43.440 --> 18:47.760
and if you're curious and want to give it a try, these are repositories, the above one is the two

18:48.720 --> 18:53.920
and just yesterday we published the Python project that does the transformation that I showed in the

18:53.920 --> 18:57.920
beginning. Thank you for your time.