WEBVTT

00:00.000 --> 00:12.000
Hello everyone, I think we can start for this next talk because it's something I'm

00:12.000 --> 00:19.040
really looking forward to see. The last word I used when I was talking earlier was

00:19.040 --> 00:26.000
Matmo and that was the perfect presentation for this. So Pete here is going to talk

00:26.000 --> 00:32.000
about how you get a card and how you actually get to zero to Matmo. Thank you very much.

00:34.000 --> 00:42.480
Okay, Mike check. One, two, good. Okay, so hopefully everyone in this room knows what zero is.

00:42.480 --> 00:46.800
And you know what, Matmo's are. And if you are in the room for the previous talk from

00:46.800 --> 00:52.000
John Necker, you'll know what the E.T. is like. One is for the benefit of anyone else. It's a chip.

00:52.000 --> 00:57.840
It's this chip here. You can find it on cute little PSA boards like that one there.

00:59.040 --> 01:04.000
And all the fun stuff is inside the pink. Right angle there. So you've got four D-armed chips

01:04.000 --> 01:10.560
around the edge and then the main chip on that blue. Pete think. So this is a block diagram trying

01:10.560 --> 01:16.400
to show you what's inside that region there. So each pink rectangle is a piece of storage

01:16.400 --> 01:21.760
and each gold square is a little risk-five CPU core. And then the black network connects

01:21.840 --> 01:27.440
everything together. So if you would count them all, you would find a thousand and ninety three

01:27.440 --> 01:32.800
risk-five CPU cores. The five of them are special. So I'm going to kind of ignore those.

01:33.440 --> 01:38.800
32 kind of night get lost for silicon yields. I'm going to print it down on that either. And then

01:38.800 --> 01:44.160
form where it takes a few. So I'm going to pretend it takes lots. And again, it's trying to take 32.

01:44.160 --> 01:51.520
Living with a nice round of 124 cores to play with. So as this is first time, I want to

01:51.520 --> 01:56.240
call out the open source things here. So the surface like is on GitHub, as one we could just

01:56.240 --> 02:02.720
told you. The firmware on GitHub manuals on GitHub, the source system. And later, again, it's on GitHub.

02:03.280 --> 02:09.120
And the article isn't yet there, but I'm told it will be there soon. And that's the man to

02:09.120 --> 02:16.080
badger if you can talk to that happens. So I've got a thousand cores. I'm one of them at a fixed

02:16.080 --> 02:20.800
clock speed in the name of we produce a ability because I don't like to run the same thing

02:20.800 --> 02:26.560
out of not the same measurement. And the matter is at that fixed clock, this hardware should

02:26.560 --> 02:32.560
give me just over 10 turnflops of FP32 compute. So that's going to be my goal for the rest of the

02:32.560 --> 02:37.920
talk. Now, hopefully some of you will have seen this meme previously, which is the how to draw an

02:37.920 --> 02:42.640
element, which is a step one you draw some circles, and step two you draw the rest of the

02:42.720 --> 02:49.200
URL. The joke, of course, in that step two is rather more work than step one. Of course,

02:49.200 --> 02:53.760
you know, AI, we could just need to ask a distribution model to do step two, nice and simple.

02:54.960 --> 03:00.160
You know, we are here to like implement AI, not just use AI. So our equivalent of drawing an

03:00.160 --> 03:06.480
owl is to create your tensile brief, like you know, GGL or tiny guard or whatever else you want

03:06.480 --> 03:12.480
to use. And step one of that is how the map moves work. And then step two is like everything

03:12.720 --> 03:18.240
else. Again, I'm just here to like draw some circles, so I got like a 20 minute slot. So

03:18.240 --> 03:22.480
you know, I'm just drawing the circles, I'm not drawing the whole owl, there's tons more work there.

03:23.680 --> 03:30.480
So to stay within my time, I'm going to say that a matrix is always 512 by 512 at FP32 because

03:30.480 --> 03:36.000
anything else is drawing the rest of the owl. And this function here will take a matrix A and a

03:36.000 --> 03:41.520
matrix B, more flattened together and put the result in matrix C. It is the worst code you will

03:41.600 --> 03:44.240
ever see if you're doing A, not well, but we've got to start somewhere.

03:45.280 --> 03:50.720
Well, what is it? It's the worst code. It runs a trotusy so on my ETIsog one, this is seven

03:50.720 --> 03:55.440
megapop like per second. It's like, you're calculated it could run faster than that. This is

03:55.440 --> 04:01.600
terrible. But you know, I've got 17 minutes to go, I think. So we can try and make this go faster.

04:02.640 --> 04:06.720
So I'm going to do this over like three main acts and it's hard to go out to one because

04:06.880 --> 04:11.600
going parallel because we are going to be go quite parallel here. So if you've written any

04:11.600 --> 04:15.120
code code previously, you will have seen a thread index and block index.

04:16.240 --> 04:20.480
Because in the code model you run thousands of copies of your code in parallel and each

04:20.480 --> 04:26.320
copy has its own distinct value of thread index slash block index. And I'm using a similar

04:26.320 --> 04:32.800
ish model here. So all the code that I'm showing you will get run 2048 times in parallel and each of

04:32.960 --> 04:41.040
those 248 times will have a distinct value of mh ID between 0 and 247. If you've been paying

04:41.040 --> 04:49.200
attention to the equipment, instead of like 1,024 calls, but 248 values mh ID to push that gap,

04:49.200 --> 04:54.480
the harder it is doing effectively, you can have two way hypersodding with two hearts per core.

04:56.880 --> 05:01.840
So I can start out by taking one of my two loops either the outer loop or the inner loop

05:01.840 --> 05:06.640
and use mh ID instead of having a loop, which should give us a nice five-fold

05:06.640 --> 05:12.880
speed up because now we're now doing five and 12 things in parallel. Unfortunately,

05:12.880 --> 05:17.040
only one version of this code gives the right answer, the other one gives the wrong answer,

05:18.080 --> 05:21.680
which is troubling because you look at the code, you like it should be correct,

05:21.680 --> 05:27.520
there aren't any like typos there, like it should be correct. So the reason for this

05:27.520 --> 05:32.400
lies in the memory hierarchy, which I'm trying to show you with this diagram here.

05:32.400 --> 05:36.880
Now we'll come back to later up so if now I only want you to take one thing away from it,

05:36.880 --> 05:42.080
which is that there are a lot of caches on this chip, so every box with a dollar sign, that's a

05:42.080 --> 05:50.480
cache and there are up to 2,465 caches, which is quite a few, and there's no currency between any

05:50.640 --> 05:57.600
of them, which is a slight problem. It's going for GPU kind of background, you'll be like,

05:57.600 --> 06:05.040
yeah, non-caches, totally fine. It's going for a CPU, quite a bunch of weight, that's a problem.

06:06.400 --> 06:11.280
So to see why this causes a problem, we need to talk about cache lines, which on this chip

06:11.280 --> 06:17.920
are 64 bytes wide. So that means that when caches talk to each other, they do so in units of

06:17.920 --> 06:24.800
64 bytes. So to make the example kind of concrete, if we should look at heart number 3, number 4,

06:24.800 --> 06:31.520
as they write to their output c, matrix in, heart 3 writes that bit, heart 4 writes that bit,

06:31.520 --> 06:37.920
but they're both writing to their own local L1, D cache, and then at some point later, the L1 will

06:37.920 --> 06:43.520
flush the entire 64 byte line rather than just a little bit the core wrote, and then one of

06:43.600 --> 06:47.840
those two what write write will over write the other, and the first one's write will get lost,

06:47.840 --> 06:54.240
and the answer's wrong, and you're in a more depend, which is bad. But thankfully, we can choose

06:54.240 --> 07:00.720
on an instruction write instruction, basis which cache to talk to. So if I use standard of this

07:00.720 --> 07:06.000
client, memory instructions, then my cores will talk to the L1, D cache, that is going to be a

07:06.000 --> 07:11.680
per core or per heart cache. I can say to use custom L instructions to talk to the L2 cache,

07:11.680 --> 07:17.600
nearest the issuing core, or I could use custom G instructions, which look more complex,

07:17.600 --> 07:21.760
because I now could have slightly more hours, but if you use them, they actually make things

07:21.760 --> 07:27.200
quite simple. They will give you the appearance of a current memory, because they will root to

07:27.200 --> 07:32.400
a cache based on the address rather than based on the core that's making the request, but this

07:32.400 --> 07:39.840
will come at a latency and throughput cost, so you know, trade us. And as I'm using a standard

07:39.920 --> 07:45.920
risk file compiler, meant for like CPUs, I get a standard risk file, memory instructions,

07:45.920 --> 07:54.480
thing is, so all my writes go to L1. So I could fix things like using G instructions, but for the

07:54.480 --> 08:02.480
particular access pattern you see, L is also fine, so now I can make that change. It's a compiler

08:02.480 --> 08:06.000
and language could have better understood the harder, maybe I would like annotate every point

08:06.000 --> 08:10.880
type with what kind of memory to use, but I've got to make use of the compiler that I have,

08:10.880 --> 08:16.960
so that's the code that I have, have changed. And now both versions are correct, and the first

08:16.960 --> 08:21.280
of the two runs, certainly, if 3.4, give it up for a second, which is still, you know, slow,

08:21.280 --> 08:28.800
but getting a little bit faster. And then the next easy way is to use all two thousand tasks,

08:28.800 --> 08:34.800
rather than only the first 512, which gets us out of the I loop, fully polarized, and then each

08:34.880 --> 08:38.960
heart is one quarter of the inner drive, all the changes are in that ping part there,

08:38.960 --> 08:44.480
the rest of the code is as before, and using four times as many hearts, give me roughly a four

08:44.480 --> 08:50.800
X speed up scenario. Now at 13.1, click pause for a second, which I said, getting a little bit faster,

08:50.800 --> 08:57.920
bit by bit. And then the final easy win is to use all eight vector lanes per heart, and at this

08:57.920 --> 09:01.920
point I'm going to pretend that the second player is a little bit better than it actually is, and

09:01.920 --> 09:08.160
I can kind of go eight wide SIMD, just like changing FB32 to FB32 X8, and fix the up scenario

09:08.160 --> 09:13.600
a few other things. So as I said, this isn't quite possible, but if there are any LVM hackers

09:13.600 --> 09:19.520
that want a little project, please make this slide possible. In the meantime, I will assemble

09:19.520 --> 09:27.440
this code by hand, because I like to write assembly code. And if I do that, then I'll run another 104

09:27.440 --> 09:36.080
give-ups a second, which you know, isn't as bad, and you know, okay, okay. So we've got as parallel

09:36.080 --> 09:41.200
as the hardware will allow us to go, which means it means you act too, which I'm fairly

09:41.200 --> 09:45.360
called being clever. If we've got enough parallel as we can go, so now we have to actually think.

09:46.240 --> 09:50.720
So again, this is the memory hierarchy diagram we saw previously, same diagram as before.

09:51.280 --> 09:56.720
You'll know I've put a star on the L2 cache, the L3 cache and L2 SCP, and that's because all three

09:56.800 --> 10:02.000
of those come from the same pool of SRAM. And you can do a partition that pool of SRAM however you

10:02.000 --> 10:07.520
want to, but these are the kind of default split between the three or that SRAM. And of these three,

10:07.520 --> 10:13.760
L2 STP is really the interesting one here, because any SRAM aside, to it behaves like normal

10:13.760 --> 10:21.440
addressful RAM, but has the latency of the L2 cache, which is nice, like it behaves like memory,

10:21.520 --> 10:27.280
but it's fast, like cache. So every core will have a nearest L2 STP, just because it has a nearest

10:27.280 --> 10:33.200
L2 cache, but any core can access any L2 STP, it'll just require a longer path around the

10:33.200 --> 10:38.880
notch, so get there, suddenly no slower, but it'll still work. And I'm not using that, but that

10:38.880 --> 10:45.040
functionality, but it is kind of nice to have it. So up until now, my matrices of NB were over in DRAM,

10:45.040 --> 10:51.280
which is quite a long way away from the cores. So I can just cheat and put those NB over in every L2

10:51.280 --> 10:56.960
STP. Obviously, if you were doing things properly, you need to stream your data in, as it needs

10:56.960 --> 11:02.960
depth, but I can just cheat and put it all out of to start with, and call it good. And if I do that,

11:02.960 --> 11:07.280
I get up to two hundred and twelve, if it pops for a second, but again, it is a little bit faster.

11:08.480 --> 11:13.280
So at this point, now I'll look at this inner loop of the mammal, and I've gone back one step to

11:13.280 --> 11:18.240
the pre-simdy code, just to keep things a little bit simpler. And this is the rest five assembly

11:18.320 --> 11:23.360
corresponding to that loop. With comments added, if there's a few first language, it is not

11:23.360 --> 11:30.800
assembly, just for the most of you. But if you do have a keynote for assembly, this line will jump out

11:30.800 --> 11:37.920
as being coming out of nowhere and having no apparent purpose. So I won't have dwell on this guy slightly.

11:39.040 --> 11:44.800
So the both I'm using is a zero silicon, and any harder people in the room will be like, yeah,

11:44.880 --> 11:50.080
zero zero silicon always has a few bugs net that require a software worker around them.

11:51.280 --> 11:56.080
And a software people in the room, you're like, what's a zero silicon? And it's basically,

11:56.080 --> 12:00.320
it's the first version that you send to TSMT to be printed. It's like version zero or

12:00.320 --> 12:04.960
your software. It has some bugs. You hopefully it doesn't have any bugs, but there are always a few bugs.

12:06.480 --> 12:12.080
And yeah, this is a workaround for one of those bugs. So now they're over the shock of seeing

12:12.080 --> 12:17.280
some assembly codes. This is back to the SIMD version. And again, it's the inner loop of the

12:17.280 --> 12:23.600
SIMD version that is of interest to me. And again, this is the assembly for that, and notice it's

12:23.600 --> 12:28.640
very SIMD that it's a family you just saw. Just for the these weird dot PS instructions,

12:28.640 --> 12:35.200
rather than 100 ones, otherwise it's all the same. Now it does matter which version we look at

12:35.280 --> 12:42.000
only 40% of the instructions are FMAs, which is rather too low. And we've got a lot of other

12:42.000 --> 12:46.400
instructions, which is kind of too high. And you know, people don't just don't tell the whole story,

12:46.400 --> 12:53.040
but like this number is too low, that number is too high, and we should fix this. And one way

12:53.040 --> 12:58.640
of fixing this is to do more work for outer loop iteration. And it so happens, you know, pulling

12:58.640 --> 13:03.440
a rabbit out of my head here, that doing four times as much work for outer loop iteration is the real

13:03.600 --> 13:08.160
sweet spot. So rather than doing a one by eight piece of C, we double up on both axes and we're

13:08.160 --> 13:14.160
two by 16 piece of C. And then the code now looks like this, you can see this goes pattern

13:14.160 --> 13:18.720
we're doing like four things, we've got like four initializes of X, four because those may,

13:18.720 --> 13:25.840
throughout the eight loads can be four lines of matmoles, four lines of stores. And if we do this,

13:25.840 --> 13:32.480
then we now up to 32% of the instructions are FMAs, and only 30% are others, which, you know,

13:32.560 --> 13:39.520
is going in the right direction. And in terms of performance, we're now above one total per second.

13:39.520 --> 13:44.480
So, you know, life is pretty good. And then we can pull the same trick again. Again, we can

13:44.480 --> 13:49.040
double up on both axes for outer loop iteration. And I'm not going to show you the code for that,

13:49.040 --> 13:53.840
because they're two by 16 and you just fit it on one side, four by 32 is four times longer,

13:53.840 --> 13:58.080
so it certainly went for it on one side. You know, the the percentages I can show you,

13:58.160 --> 14:03.520
and the end things are going in the right direction. And performance gets me at just under three

14:03.520 --> 14:08.480
top of the second, which you know, it's starting to be like, okay, still a lot so I don't know,

14:08.480 --> 14:16.000
like, but you know, kind of, okay. So to get faster, I'd like to do a 100% of my instructions to be

14:17.040 --> 14:21.120
FMAs, of course, will still need like some loads and some other things, I'm going to like name,

14:21.120 --> 14:26.160
four partners at 20% and five percent. I wish when you're going to say that this adds up to 125%

14:26.240 --> 14:33.360
which is clearly not possible. But just to quite as some creative thinking, not cheating,

14:33.360 --> 14:42.240
just creative thinking. So to hit 125% all I need to do is to on average execute 1.25

14:42.240 --> 14:48.080
instructions per core, per clock cycle, which means at least occasionally I need to execute two

14:48.080 --> 14:55.680
instructions per clock cycle, so that my average can hit like 1.25. So which means it to act three,

14:55.760 --> 14:59.840
which I'm calling magic hardware. You know, we've been as clever as we can,

14:59.840 --> 15:05.120
now we have to rely on the hardware being magical. So this is another diagram,

15:05.120 --> 15:10.320
I'm going to show you the ETFs like one, with the main reason in the bottom last being a single

15:10.320 --> 15:16.080
minion CPU core. You'll note, I've found a scalar, you know, inventing it as separate things.

15:16.080 --> 15:20.160
So you might hope like you get like two instructions per cycle by using them both on every cycle.

15:21.120 --> 15:25.360
But I can't do that through the risk file, front end, because it could only issue one

15:25.360 --> 15:31.120
instruction per cycle. There's not enough like front end to two. So that's not going to work.

15:31.680 --> 15:35.840
But as I said, there is some magical hardware here which we could use to cheat. So,

15:37.040 --> 15:41.680
same diagram, but with the additional settings, the hardware that is present on this chip,

15:41.680 --> 15:48.000
which I've drawn is three additional green boxes because just as with the risk file front end,

15:48.080 --> 15:52.480
each of these can issue one instruction per cycle. So now, as I have a fourth of them,

15:52.480 --> 15:57.120
I can enter you hit four instructions per per clock and get my average up and call things good.

15:59.440 --> 16:04.640
Now, we can kind of quickly go to what each of these do. So you can use TensorFlow to load

16:04.640 --> 16:09.440
to load data into L2 SCP, which if you're doing profumant pop-up map models, you would be using it.

16:09.440 --> 16:12.560
But now, I'm not drawing the complete else, so I'm skipping on him.

16:13.520 --> 16:20.160
TensorFlow one load, again, can only load data, but its destination is rather more fun.

16:20.880 --> 16:26.800
So I learned that we've lost a chunk of our L1 decash and turned it into this thing called L1 SCP,

16:26.800 --> 16:30.480
which is a versus file for reusing the results of TensorFlow on loads.

16:32.320 --> 16:36.480
And then there's also this direct pass, in TensorFlow one load to TensorFlow compute.

16:36.480 --> 16:40.960
So for any loads that you only use of the result of once, they can kind of go straight to compute.

16:41.920 --> 16:46.800
This is called A type loads and B type loads, and I put A and B on the diagram, because it's going to be

16:46.800 --> 16:52.880
important in just a bit. And then turns into compute is really like what I'm here to properly use.

16:53.440 --> 16:58.000
I've linked it to that a right unit to show that it can use the right unit at the same time as the

16:58.000 --> 17:04.480
front end using the scale unit, so I can use both of those in parallel. And I've labelled C for the link

17:04.480 --> 17:10.800
across to the vector storage. You also know that all my models have been doing like A times B

17:10.960 --> 17:16.240
equals C, so hopefully you can see where I'm going here, which is that this, my green box,

17:16.240 --> 17:20.320
will take an A type load, multiply it with a B type load, and there's all because they've

17:20.320 --> 17:28.480
had C to be sort in hard zeros, vectors. And then all we need to do is to use this extra

17:28.480 --> 17:33.840
module hardware, and we can make things go faster. And as it is going to take the three sides of code,

17:34.720 --> 17:40.480
this is side one, doing the kind of M-heart's idea stuff. So the only thing we're thinking is that

17:40.560 --> 17:46.560
only heart zero of each core can use tensor compute. So I'm making heart one just to finish immediately

17:47.680 --> 17:53.440
and we'll create it reading the tensor L2 loads, but I'm not drawing the complete L. This is

17:53.440 --> 17:58.000
inside two of three, the main loop of the map model. And you're going to say this reads like

17:58.000 --> 18:04.240
complete mud. Yes, it does read like complete mud. There's not yet particularly nice in

18:04.240 --> 18:11.680
a syntax for making use of these tensor units. So like you can, but it's not pretty. So just like

18:11.680 --> 18:17.040
take on face to like this does work, you know, it's like ugly as anything. Really you can grab

18:17.040 --> 18:22.880
the harder I manuals to cross reference for this to figure out what it is doing. And then we have

18:22.880 --> 18:28.160
side three of three for this slightly faster map model, just doing a store of C of off to memory.

18:28.240 --> 18:34.800
Again, clear as mud, I'm sure you understand everything here. But so it is fast code. It will

18:34.800 --> 18:41.120
run at now 7 to 12 lbs per second, which is starting to approach my goal here. Like I'm aiming

18:41.120 --> 18:50.560
for 10, I've got 7 kb worth. So the code I just showed you is the code is like this left hand

18:50.560 --> 18:56.320
corner. There's side one of calculating i and j in the M-heart idea stuff. And then side two

18:56.320 --> 19:01.760
was just like repeating loop of things and loads and some FMAs. And then side three was doing

19:01.760 --> 19:10.560
a store. Now, it is useful to re-things slightly and peel off the first load and peel off the

19:10.560 --> 19:16.800
last FMA, which gets us to this middle corner. And we have now grouped FMA and then load rather

19:16.800 --> 19:22.480
than load and then FMA. And notice it like this doesn't yet look like a useful change to a

19:22.480 --> 19:28.480
two half made. But if you do this, then this repeating group, now the FMA and the load don't

19:28.480 --> 19:34.400
depend on each other. So I can do them in parallel, which would give me to slide to the white

19:34.400 --> 19:39.760
hand column here, wondering if my FMA is my loads in parallel. And then look at the clear as mud

19:39.760 --> 19:45.760
code for the white hand column. And again, you're going to be like yep, side of code, clear as mud.

19:46.400 --> 19:52.320
So again, this is going to take three sides of code. This is kind of what we saw before,

19:52.320 --> 19:56.800
with the M-heart idea stuff and then the first tens of load. Again, I don't exactly

19:56.800 --> 20:00.320
understand that this, I just want one of you to kind of appreciate that it can be done.

20:02.400 --> 20:06.720
Again, side two or three, it can be done. It's, you know, rearranged some stuff. We've got the

20:06.720 --> 20:12.880
FMA last and the, so the FMA first and the, the, the, the, the last. Again, clear as mud. I'm

20:12.880 --> 20:18.880
sure you get all of that. Slide three is now the final FMA fused with that store.

20:20.160 --> 20:26.880
Wonderful code, moving 50 on. But what I would don't say is that it now runs at just over 10

20:26.880 --> 20:35.680
terms per second. So I think that's close enough to my goal of 10.6. And as I think I'm pretty much

20:35.680 --> 20:41.280
out of time with like two more seconds to go, so an else can draw the rest of this owl. Because I think

20:41.280 --> 21:02.000
I'm done. Any questions? Everybody enters to everything. You explained really well on

21:02.000 --> 21:14.160
one question. And could you say a few words about what you actually based these, these

21:14.160 --> 21:18.720
experiments on, did you manage to get, you know, like some evolved word from Esperanto on eBay

21:18.720 --> 21:26.960
or is there something new from Enico? Can I answer? We gave you one. Actually, if you're nice

21:26.960 --> 21:31.840
to the community and ask us, we have a bunch of one that's right now. So basically it's

21:31.840 --> 21:47.520
not that hard. Yeah. So yeah, you can assign an int that allows to get access as well if you

21:47.520 --> 21:54.800
know, can't actually get a card, but they do have some cards. Yeah, we have about 10 machines in the

21:54.800 --> 22:01.520
lab that are available and we have some cards like about maybe 80 of them left that we can

22:01.520 --> 22:06.960
send to people and actually have some chips left. So somebody is interested in designing a board

22:06.960 --> 22:13.680
with that. That's also an option. One additional question. How long did it took for you to

22:13.680 --> 22:21.120
optimize the whole thing like the whole chain of optimizations? I'm not sure. I'm going to ballpark

22:21.120 --> 22:30.800
a week, baby. Yeah, I don't really measure my time. Yeah. And how long did you take to optimize

22:30.800 --> 22:46.080
an anthropic challenge? Two weeks. Number one. Well,ish. So number one. Yeah, one one last question.

22:46.080 --> 23:05.840
Hi, so all these changes that you did were at the assembly level, this is. So the work

23:05.840 --> 23:12.240
now that you are putting is at the compiler, putting these changes into compiler and upstream

23:12.240 --> 23:19.520
or how is it? Is it only at assembly level somebody would get the benefit or you are

23:19.520 --> 23:28.960
actually trying to get everything into your compiler and upstream into LLVM or GCC? How would it be?

23:28.960 --> 23:48.400
So John Lucas is probably more appropriate to answer that one. But LVM is upstream. The GCC port

23:48.400 --> 23:56.720
is slightly further along for historical reasons, but probably less likely to be upstream. So

23:56.720 --> 24:05.920
I think going forward, upstream LVM is where things are headed or you can jump in? Yeah, I mean,

24:05.920 --> 24:13.360
you're unfortunately correct, but we're not giving up on GCC, so actually. But yeah, so LLVM is

24:13.360 --> 24:19.520
upstream, but you can definitely use GCC today if you only use assembly and line is going to be

24:19.520 --> 24:23.520
fine. The moment you start being a bit too clever with C and using too many floating point

24:23.520 --> 24:29.280
register, you're going to have a bad time. I'm going to call it good for time now.