WEBVTT

00:00.000 --> 00:11.600
Hello everyone, welcome to Flashtem, the graphics devrim, we're kicking it off today with

00:11.600 --> 00:17.440
a talk about Jamadashidus in PNBK with WebPoly.

00:17.440 --> 00:21.240
So about me for those you don't know, because I haven't been to Flashtem in like a decade.

00:21.240 --> 00:28.080
I mean, it's Feifexbrand, I work for Clapper, I've been on a free desktop for quite a few

00:28.080 --> 00:30.960
years now.

00:30.960 --> 00:37.760
I was at Intel for a while where I did the Intel Vulcan driver there for Linux, I did a bunch

00:37.760 --> 00:43.920
of other stuff, compiler stuff, various Intel bits, a bunch of Vulcan common code that

00:43.920 --> 00:49.480
I've raised now using, and then I've been at Clapper's since 2022, where I worked basically

00:49.480 --> 00:53.160
anyone lap Linux graphics stack where I'm needed.

00:53.240 --> 00:58.880
I wrote the NVK library stack from video hardware, and I spent the last two or three

00:58.880 --> 01:03.800
months helping out the PNBK team, trying to get stuff in a little bit of a shape there,

01:03.800 --> 01:06.680
and we're going to talk a little bit about that today.

01:06.680 --> 01:10.600
So what did Jamadashidus, that's the first question, because we need to talk about what

01:10.600 --> 01:15.480
these are and the wide bit hold and why this is interesting stuff.

01:15.480 --> 01:21.680
So Jamadashidus sit right before the actualization, so you have your vertex shading, you

01:21.720 --> 01:27.320
might have some test solution, if you have a high enough API version for that, and then

01:27.320 --> 01:31.800
you have potentially a Jamadashidus that runs out and put it aside.

01:31.800 --> 01:37.240
And unlike some of the other shaders stages, Jamadashidus shaders act pro-primitive, so

01:37.240 --> 01:40.840
they consume primitives, they produce primitives, they might consume, and produce different

01:40.840 --> 01:45.360
primitive types, like they can consume fire calls, produce points, consume points, produce

01:45.360 --> 01:51.400
triangles, and they can also dynamically change the number of primitives, so you can consume

01:51.400 --> 01:54.400
one point and produce seven triangles.

01:54.400 --> 01:57.640
You can have a geometry that sometimes produces seven triangles, sometimes produces zero

01:57.640 --> 02:01.920
triangles.

02:01.920 --> 02:07.000
So here's one example of a geometry shader that takes a point and produces two triangles

02:07.000 --> 02:09.280
to make a square.

02:09.280 --> 02:10.280
Why might you want to do this?

02:10.280 --> 02:15.000
Well, maybe you don't trust point sprites, and you want to do some sort of environment

02:15.000 --> 02:19.240
effect, and it's easier to show a pixel, points down your pipeline, and you can do everything

02:19.240 --> 02:21.360
with a geometry.

02:21.360 --> 02:27.640
And the way this works is you have your normal outputs, like in this case, geoposition,

02:27.640 --> 02:33.080
and you write to it, and then you do emit vortex, and that causes that fully text to be

02:33.080 --> 02:35.760
kicked out, and then you set up your outputs again, and you emit vortex, and you set

02:35.760 --> 02:40.840
up your outputs again, and you emit vortex.

02:40.840 --> 02:49.240
Here is another example that turns a solid triangle into lines, so we walk over all the

02:49.240 --> 02:53.920
vertices, and we do the first one, second one, and third one, and then back to the

02:53.920 --> 02:56.240
first one again, as a line loop.

02:56.240 --> 03:01.920
And voila, we can do lines.

03:01.920 --> 03:06.040
Here's one that duplicates triangle, so you put it one triangle, it gives you out two.

03:06.040 --> 03:08.720
These are just some examples of things that you could potentially do with geometry shaders.

03:08.720 --> 03:12.920
In the way I will, again, you can talk again, but also the application can do whatever

03:12.920 --> 03:13.920
it wants.

03:14.320 --> 03:20.920
I want to hear the tell them what they can and cannot do, but this is some examples.

03:20.920 --> 03:26.920
OK, so let's talk about why geometry shaders are hard.

03:26.920 --> 03:33.440
First of all, like I said, the number of proofs you can emit is dynamic, so emit vortex

03:33.440 --> 03:35.920
and primitive might be in control for.

03:35.920 --> 03:40.920
So you might have a loop that limits a bunch of primitives, and the loop has a loop

03:40.920 --> 03:44.920
counter, and the loop counter might not be known at the pile time.

03:44.920 --> 03:49.920
So we don't necessarily know when we're looking at the shader code or the object that

03:49.920 --> 03:56.360
we've compiled, how many primitives we're going to get out where we primitive we put in.

03:56.360 --> 04:01.200
We try to figure the certain pile time if we can, and if we can unwall the loops, and

04:01.200 --> 04:05.760
we can do all things, then oftentimes we can figure this out, but sometimes we can't,

04:05.760 --> 04:08.400
and we have to be able to deal with that.

04:08.400 --> 04:14.600
We also have to deal with partial primitives, so if you are emitting vertices, if you're

04:14.600 --> 04:21.040
emitting triangles, and you emit five vertices, that only gets you on primitive, so you

04:21.040 --> 04:24.800
can end up with sort of redundant stuff at the end of your call, and we have to deal

04:24.800 --> 04:31.800
with that.

04:31.800 --> 04:36.600
We also have to be able to output primitives in a predictable order, and this is one of the

04:36.600 --> 04:41.640
hardest problems with geometry shaders, and why everybody's built, and why it just

04:41.640 --> 04:45.240
generally sucks.

04:45.240 --> 04:47.160
There's a few different reasons for this.

04:47.160 --> 04:53.800
One is that geometry shaders can feed into transform feedback, which if you are a bit

04:53.800 --> 04:57.240
of how deep you're feeling, and if you're cringing right now at the thought of that API,

04:57.240 --> 04:59.720
but there it is.

04:59.720 --> 05:01.680
We need to be able to support that.

05:01.680 --> 05:05.720
Transform feedback is something that goes back quite a long ways in Vulcan, and basically

05:05.720 --> 05:11.640
it's quite a long ways in OpenGL, and we added it to Vulcan, unfortunately, and what

05:11.640 --> 05:17.640
it does is it allows you to ask the hardware to please take all of the data that's coming

05:17.640 --> 05:21.640
out of the geometry pipeline and wait it into a buffer for me.

05:21.640 --> 05:24.240
What we need to be able to do that, and the order that we, all right, that data out needs

05:24.240 --> 05:27.880
to be predictable, because applications are then going to weed that data, and do something

05:27.880 --> 05:30.800
with it, and we need to know with this.

05:30.800 --> 05:34.160
There's also some things that are order-dependent in the actualization pipeline, so

05:34.160 --> 05:38.840
color-blending, for instance, is not necessarily something that you can do in overtry

05:38.840 --> 05:41.680
order and get the same results.

05:41.680 --> 05:47.360
Some color-blending equations, like just adding or doing a minimum max, are predictable,

05:47.360 --> 05:50.240
but some of them are not.

05:50.240 --> 05:53.560
Deptesting is also not necessarily order-dependent.

05:53.560 --> 06:01.600
It's mostly order-dependent, depending on your values, but if you're using a fixed-point

06:01.600 --> 06:08.920
depth format, like our 16 or 24-bit, then whether you use greater or greater equal, gives

06:08.920 --> 06:13.920
you a first primitive wins or the last primitive wins rule, and if you don't know what

06:13.920 --> 06:18.760
your primitive order is, you can get unpredictable results with that testing as well.

06:18.760 --> 06:25.600
So it's really important that we output stuff in the correct order.

06:25.600 --> 06:29.160
And then finally, we also don't know, then I'm going to put primitives necessarily.

06:29.160 --> 06:34.320
So this isn't the worst of the problems, but it is something that we have to be able

06:34.320 --> 06:36.480
to deal with in the implementation.

06:36.480 --> 06:39.760
And this is mostly due to dry and direct, although there's a couple of other cases where

06:39.760 --> 06:42.120
that pops up as well.

06:42.120 --> 06:46.560
Okay, so let's talk about geometry shaders on Mali, which is what this is what we're trying

06:46.560 --> 06:47.720
to do here.

06:47.720 --> 06:52.840
So, so the Vahal Mali added support to enable geometry on a shading, they added layer

06:52.840 --> 06:58.520
to rendering in hardware, which is really nice not having that would have been a pain.

06:58.520 --> 07:02.720
But they added the built-like GLA and GL primitive idea from a vertex shader.

07:02.720 --> 07:07.920
That is the sum total of Mali's geometry shaders' support.

07:07.920 --> 07:13.200
Everything else is just left as an exercise to the leader, and unfortunately that's us.

07:13.200 --> 07:17.440
So, how are we going to do this?

07:17.440 --> 07:24.960
And the answer is computer shaders, lots of computer shaders, and also lots of dry

07:24.960 --> 07:29.600
and direct, which we'll get into that in a bit.

07:29.600 --> 07:34.000
So this is where Liv Poly comes in.

07:34.000 --> 07:36.160
What is Liv Poly?

07:36.160 --> 07:42.960
So Liv Poly is a library that was really, really written as part of the Asahi driver stack

07:42.960 --> 07:51.200
for the Apple M1s by Osuluers and Sweet, and she did it because she was trying to build

07:51.240 --> 08:00.720
a basically desktop capable Vulcan driver for what is effectively mobile silicon.

08:00.720 --> 08:04.240
And so, obviously she needed to be able to support all the desktop APIs because people want

08:04.240 --> 08:08.080
to play with play Steam games, and so we need to be able to support Java shaders' facilitation

08:08.080 --> 08:12.680
shaders' works.

08:12.680 --> 08:19.880
And then, Mali did a bunch of work to pull Liv Poly out so that it can be used by other

08:19.880 --> 08:27.600
drivers, and we've since done some cleaning up to get it into the shape where it's more

08:27.600 --> 08:30.760
broadly usable.

08:30.760 --> 08:35.640
It is written in a combination of open CL and no.

08:35.640 --> 08:39.880
The manual and those stuff is mostly for the glowing passes, we need to be able to take

08:40.840 --> 08:45.800
a Java shader and tone it into the sequence of compute shaders that needs to run.

08:45.800 --> 08:49.560
There's some sort of manual manipulation that needs to happen.

08:49.560 --> 09:01.160
But then the actual details of how do we do vertex fetching, how do we deal with all the stuff

09:01.160 --> 09:06.680
we have in-depth buffers, the test later, which we're not going to be talking about that

09:06.760 --> 09:11.640
detail today, all of that stuff is just written in the open CL, and the open CL kernel

09:11.640 --> 09:21.640
then get compiled down into no builder or straight into open CL kernels.

09:21.640 --> 09:29.000
Currently, the M1 support test solution, but we haven't actually pulled that out into

09:29.000 --> 09:33.560
Liv Poly yet, there's still a lot of that lives in the M1 driver, but it is in the

09:33.640 --> 09:39.720
low map to do that. And then with this, we can implement geometry shaders, transform feedback,

09:39.720 --> 09:46.360
and all of the other stuff that is generally a pain on tiling GPUs because we're able to do in

09:46.360 --> 09:49.720
commuteshaves.

09:49.720 --> 09:54.840
Okay, so how does Liv Poly solve these problems? So, I had a list of problems, I'm going to

09:54.840 --> 10:00.200
kind of go through them and talk about how they get solved. First, let's look at every simple case,

10:00.280 --> 10:04.200
because this case is actually getting this case where it is actually really important for

10:04.200 --> 10:07.400
getting any kind of performance out of this in the real world.

10:09.400 --> 10:14.360
So, let's say we have a very simple case. We have a regular, not indirect,

10:14.360 --> 10:21.560
do I? So, no funny business there. There's no primitive we start because that screws up everything,

10:21.560 --> 10:28.440
because that's a wonderful API. The geometry shader produces a compile time known number of

10:28.440 --> 10:32.120
primitives. So, we can look at the geometry shader and we can see there's one primitive or two

10:32.120 --> 10:38.600
primitives. And there's no transform feedback. Okay, so if all of this happens,

10:41.480 --> 10:48.120
then we can compute the number of output primitives up front, which is really nice because this

10:48.120 --> 10:53.000
allows us to allocate memory up front, because one of the other problems that we have here is that

10:53.640 --> 10:59.240
the output of my vertex shader, which is now a compute shader, has to go somewhere to be

10:59.240 --> 11:03.880
fed into the geometry shader. And it's nice to be able to allocate that with a fixed size.

11:06.680 --> 11:14.520
And this also means we can use direct dispatches and drives because we don't need to, we know

11:14.520 --> 11:21.880
all of the numbers of things up front. But the really nice thing here is that we can actually

11:21.880 --> 11:30.840
run the entire geometry as a vertex shader. So, the vertex shader you get out looks like this.

11:32.360 --> 11:38.680
It just switches on the vertex ID, the tomons which put of the primitive that it's in,

11:38.680 --> 11:47.160
based off the vertex ID, and runs that particular chunk of the geometry shader. And this allows

11:47.160 --> 11:54.680
us to run the entire geometry shader, as if it's part of the vertex shader. And so this makes

11:54.680 --> 12:02.200
a bunch of cases with geometry shaders basically free. In particular, when you're dealing with

12:02.200 --> 12:11.720
transform feedback, this means that we can do transform feedback with a geometry shader, but the

12:11.800 --> 12:18.120
geometry shader does nothing, and it doesn't actually add any overhead. Because even with

12:18.920 --> 12:22.520
out having to deal with geometry shaders, transform feedback is still a giant pain, because

12:23.720 --> 12:28.200
mobile hardware does not like doing memory writes from vertex shaders, which is what you have to

12:28.200 --> 12:34.680
do for it. So, what we do here is we can just insert a dummy geometry shader, it just does

12:34.680 --> 12:40.280
simplest geometry shader you can possibly do. And we do transform feedback as part of the,

12:40.360 --> 12:45.240
now compute vertex shader that feeds into this dummy geometry shader, and the dummy geometry

12:45.240 --> 12:52.520
shader compiles down to basically just copy data, because even the switch can compile away

12:52.520 --> 12:59.880
if the switch is just copying data. So, this takes the single most common geometry shader case,

13:00.680 --> 13:09.720
and basically eliminates it for us. Unfortunately, that's the simple case.

13:10.600 --> 13:18.040
The complicated case is a lot more complicated. So, let's talk about the different problems

13:18.040 --> 13:23.080
I said that make geometry shaders hard. First is that we don't necessarily know

13:23.080 --> 13:29.320
the vertex in primitive counts. One way this can happen is we're trying to act. So, with

13:29.320 --> 13:37.080
dry-indirect, you have all of the vertex buffers and index buffers that get set up, and then the actual

13:37.160 --> 13:44.760
drawers is in a way of GPU visible memory that contains your base vertex, your vertex count,

13:44.760 --> 13:51.880
base instance, instance count, and the hardware reads it out of memory, and then draws that amount.

13:53.240 --> 13:58.920
But we as a dry-of-oh not, that have visibility into how much drawing is actually being done.

13:59.080 --> 14:06.760
There's also cases where a direct draw can basically end up as an indirect. So,

14:07.560 --> 14:12.680
one example of this is primitive we start, because what primitive we start does is you have an

14:12.680 --> 14:17.720
index draw, so you have an index buffer, and the index buffer has a bunch of indices in it that

14:17.720 --> 14:24.360
tell you which vertex to use, very common in applications that have to do index draw.

14:24.680 --> 14:31.320
And then there's a magic index value that says, actually don't draw anything, we set the

14:31.320 --> 14:38.840
primitive for get whatever half primitive you've started. This is particularly useful if you're using

14:38.840 --> 14:45.080
triangle strips, triangle fans, loops, line loops, et cetera, because you can just do like

14:46.440 --> 14:50.840
five vertices for your triangle strip, and then primitive we start, and then do more vertices,

14:50.840 --> 14:55.400
and then primitive we start, and do a bunch of triangle strips in a single draw, it's very useful for

14:55.400 --> 15:01.320
that. But from an output perspective, it means that we can't actually tell how many vertices

15:01.320 --> 15:04.920
are going, or how many primitives are going to be done. We can tell how many indices are going

15:04.920 --> 15:07.800
in the buffer, but without knowing what those indices are, we don't know how many times it we

15:07.800 --> 15:14.680
start, and so we don't know how many primitives that actually turns into. Another case we can get into

15:14.680 --> 15:20.680
vectors, we're not diving into this in super great detail today, but test relationships also

15:20.680 --> 15:26.840
effectively produce an indirect draw going into the geometry. It's the same solution for all of these,

15:26.840 --> 15:36.520
but yeah, so how do we solve this? We have this thing called poly heat, and the way this works

15:36.520 --> 15:41.480
is the inside the driver you detect the indirect case where you're going to need the heat,

15:42.440 --> 15:47.240
and you allocate a big heat the first time you ever need one, and then the driver is about

15:47.240 --> 15:54.680
a kits out of it. Currently there is no solution for growing this heat dynamically. One could be

15:54.680 --> 16:04.680
developed. There are tricks on Molly, for instance, to do flush out the tile buffer and do these

16:05.880 --> 16:09.960
incremental rendering, where we realize that we're getting too much memory, and so we flush

16:10.040 --> 16:15.000
the tile buffer, and then we start again, we have not hooked this into this heat. It could

16:15.000 --> 16:25.960
theoretically be done, but it's a bunch of work that hasn't been done. For the case in the M1,

16:26.680 --> 16:30.760
a list I found a heat value that lets you play every D3D title she's tried, and call it good.

16:32.600 --> 16:36.680
And honestly, as long as the set size isn't too big, it's probably a fine solution,

16:36.680 --> 16:43.080
because most applications aren't using geometry shows that heavily, and so you really only

16:43.080 --> 16:49.240
penalizing the ones that are using this feature, so it's kind of fine. If you're on a phone or whatever,

16:49.240 --> 16:52.840
it's going to be the one game you have on your screen that needs this buffer, it's not going to be

16:52.840 --> 16:57.400
your system composure or any of that stuff. So it's kind of fine to just allocate a bunch of memory

16:57.400 --> 17:03.480
the first time you see a weird geometry to share the case. And then the other thing we do,

17:03.560 --> 17:12.600
to deal with unknown counts, is that we do the actual setup as a computer shape. So we have a

17:12.600 --> 17:19.240
tiny little 1 by 1 by 1 computer, that runs, it looks at the indirect wire data, it does the calculations

17:19.240 --> 17:24.520
for how much memory we need to allocate and allocate some of the heat. And then we feed that into

17:24.520 --> 17:31.080
the indirect wire. We feed that into the wire. So how do we drop the heat?

17:33.480 --> 17:37.880
This is one of the problems that we have, is that, okay, so we can allocate stuff on the heat,

17:38.760 --> 17:45.080
that's easy enough, that's just an atomic. We can do atomic. How do we actually feed this into the

17:45.080 --> 17:48.120
3D hardware? Because we're doing a bunch of compute shaders, but at the end of the day,

17:48.120 --> 17:55.080
this is going into very, very fixed function hardware. And when a list was originally doing this,

17:55.080 --> 18:01.080
unlike unmoly, where we have this command stream, where we can use, that can do memory loads

18:01.080 --> 18:08.360
and do state setup and stuff, the M1 doesn't have that. The only thing the M1 has is dry indirect,

18:08.360 --> 18:14.200
and it's very, very baked into hardware. It was just like a dry indirect call that you pass it,

18:14.200 --> 18:20.040
the 0.0 to the 45D words that describe your indirect wire, and you get an indirect wire.

18:20.120 --> 18:31.080
So we need to be able to feed into that. For anything that is just a compute shader,

18:31.880 --> 18:36.600
so if we're feeding from a vertex shader into a geometry shader, we're feeding from a

18:36.600 --> 18:41.960
tasselation shader into geometry shader, anytime with the destination is a compute shader,

18:41.960 --> 18:48.200
we can just pass pointers around. We have 64, but pointers that we can use in the hardware,

18:48.280 --> 18:57.320
we can pass this around, it all works. Votex data is typically fetched using global leads,

18:57.320 --> 19:01.000
so we don't really need to worry about that side, that's also kind of left up to the driver,

19:01.000 --> 19:06.680
that doesn't really come from the heat. But at the end, we need to build feed to the

19:06.680 --> 19:11.800
afterizer, and the afterizer is going to be fairly fixed. So the way we do this is we're trying to

19:11.800 --> 19:20.040
direct. So the Votex data, like I said, is the Votex shader is responsible for loading it,

19:20.040 --> 19:24.680
you as a driver, we have to figure out how that works. That's fine, that's the easy point.

19:24.680 --> 19:28.440
The Votex, the output of the Votex shader then goes in memory someplace, that goes into a

19:28.440 --> 19:36.200
pointer, that's fine, we can handle that. The point point is the index buffer. So there's a

19:36.200 --> 19:40.280
bunch of these cases that end up generating index buffer is that feed into the next thing. In fact,

19:40.280 --> 19:44.520
most of Liv Poly works in terms of index buffers. The hardware needs to be able to access the

19:44.520 --> 19:49.960
index buffer, and the index buffer lives somewhere inside this heat. And the way that we deal with that

19:49.960 --> 19:59.880
is that whenever we're doing an indirect raw, we have the actual like API level index buffer binding

19:59.880 --> 20:04.280
points to the base of the heat, and we use base index to offset to wherever we actually live

20:04.280 --> 20:10.600
some of the heat. And so by doing this, we don't need to be able to pass around arbitrary GPU

20:10.600 --> 20:15.000
pointers that come from computer shaders into that fixed function hardware. We can pass around

20:15.000 --> 20:20.440
the base of our allocation, which you know on the CPU, and then we can use the base index from the

20:20.440 --> 20:27.080
indirect raw in order to actually get to that offset in the buffer. So O index buffer can live

20:27.080 --> 20:32.280
anywhere we want them to, and we just do that to get everything into a direct and off we go.

20:33.000 --> 20:39.320
And we also assume that our code is correct, and that we already did the balance checking

20:39.320 --> 20:45.560
for index buffers, and so when we bind this index buffer, we literally just bind the entire heat.

20:47.160 --> 20:51.320
And we just assume again that we're not generating bad indices of bad sizes,

20:52.760 --> 20:56.440
which you know, hopefully we can touch it on code, but you know who knows.

20:57.320 --> 21:06.120
And then there's also some cases we'll use dispatch indirect. So if we're feeding from a

21:06.120 --> 21:09.640
computer shader into another computer, and we don't know how many parameters we have,

21:09.640 --> 21:12.920
then we might not know how many computer applications we need for the second thing,

21:12.920 --> 21:17.720
and so that's a dispatch indirect. And it's basically the same story. We have the one by one

21:17.720 --> 21:24.120
computer that sets up allocations memory sets at parameters. It can also set up the indirect dispatch.

21:24.920 --> 21:30.840
OK, so that problem is sorted. We can deal with, hopefully, amounts of geometry,

21:30.840 --> 21:37.960
we don't know at CPU time. How do we deal with primitive ordering? So this is also a giant pain.

21:40.360 --> 21:45.880
And again, if the number of outputs is fixed, everything is easy. We can literally just multiply.

21:45.880 --> 21:53.880
Oh, good. We have the number of primitives proved, we can just place them in the output

21:53.880 --> 22:01.160
buffer based off of that. Everything's good. If not, we have to do geometry shading in flow passes.

22:03.960 --> 22:08.680
The first pass runs an extremely stripped down version of the geometry shader,

22:08.680 --> 22:14.920
where we've deleted everything except counting the number of primitives. All it does is it runs

22:14.920 --> 22:18.840
enough geometry shader to figure out how many primitives that particularly geometry

22:19.480 --> 22:28.120
is going to do. Hopefully, if everything folds nicely, this just turns into writing out a uniform

22:28.120 --> 22:33.080
or something, because maybe you have some loop inside your geometry shader that it's primitives,

22:33.720 --> 22:39.160
really all we need to do is the loop that. So hopefully, it's a very tiny shader that isn't

22:39.160 --> 22:45.080
actually doing anything interesting. But we run this computer, it figures out the number of

22:45.160 --> 22:50.840
primitives, and it writes that out to memory. Then we run an algorithm called a prefix sum.

22:52.600 --> 23:00.920
And a prefix sum just takes a sequence of integers, and it produces, for each one of those integers,

23:00.920 --> 23:05.880
the sum of all of the integers before. It's a fairly standard algorithmic computer science.

23:05.880 --> 23:12.040
It's well understood how to parallelize this. And we have a prefix sum algorithm that's written as an

23:12.120 --> 23:19.320
opensial kernel that uses subgroupops and shared memory as needed to try and is efficiently as

23:19.320 --> 23:25.720
possible to a prefix sum. The reason why we need to do this is because we need to know what each

23:25.720 --> 23:31.560
output primitive goes in our output buffer in order. And so we pushed right out the number of

23:31.560 --> 23:36.600
primitives, and then we prefix sum, and that tells us at each primitive index where that primitive

23:36.600 --> 23:43.080
needs to go in the output buffer. Once that's done, then we can run the final geometry average shader.

23:43.080 --> 23:48.840
And this geometry shader is going to then do a bunch of geometry sharing stuff, do a bunch of calculations,

23:48.840 --> 23:55.080
and write out at the end each primitive in the output buffer. And then the vertex shader in this case is

23:55.080 --> 24:01.880
just a pass through. So this is not the case we can run the geometry shader, as if it were the

24:01.880 --> 24:07.640
vertex shader we have to have a vertex shader that then copies from the data output by the

24:07.640 --> 24:16.280
geometry shader into the vectorizer in our field. Often on these GPUs, like a molly we could

24:16.280 --> 24:24.680
feel radically on some generations feed the vectorizer directly ourselves, but on some of them we

24:24.680 --> 24:29.320
really need to want to rotate shader anyway. A vertex shader that just copies data is not particularly

24:29.400 --> 24:37.000
expensive, it's fine, but we do need something there to take from an output buffer and put it into

24:37.000 --> 24:52.440
the last lesson. And there are also some simple cases where we might need to do all of this,

24:52.440 --> 25:01.160
even if we don't have a geometry shader. One is transform feedback, like I mentioned before,

25:01.720 --> 25:05.800
vertex shaders that have side effects, like writing goal memory, doing atomic things like that.

25:07.000 --> 25:12.680
Those are very unfriendly to tie with GPUs, because titles like to re-run vertex shaders

25:12.680 --> 25:18.680
total, vertex shaders can end up getting run in an arbitrary number of times. And so you really

25:19.320 --> 25:25.640
want to avoid doing side effecty things in them, because that gets way c in weird. And

25:27.080 --> 25:31.880
by doing this we can just do the transform feedback as part of the geometry shader, which runs

25:31.880 --> 25:39.640
before the vertex shader. And we get transform feedback fairly easily. Doing transform feedback

25:39.640 --> 25:46.040
from a compute shader is relatively easy compared to some of the other options. There's also

25:46.040 --> 25:51.240
some pipeline statistics that are easier to compute this way. So there's pipeline statistics for

25:51.240 --> 25:58.840
things like primitive generated. And even if you don't have a vertex shader, if you have,

25:58.840 --> 26:05.000
it is if you don't have a geometry shader. It might just be the number of

26:06.200 --> 26:10.360
vertices divided by something, but it might get more complicated than that, especially if you

26:10.360 --> 26:16.600
have a primitive result. And so we just turn on the geometry shader pass in that case.

26:18.840 --> 26:24.200
Okay. Now let's talk about what this looks like inside the driver. So how does integrate?

26:24.200 --> 26:28.840
So you have all this code. All of this is living in a show folder. It's all implemented. You can

26:28.840 --> 26:34.040
basically just turn it on, but just turn it on is not as easy as it sounds. So what does this look

26:34.040 --> 26:42.280
like inside the driver? A lot of the work to hook this up wasn't actually work on lipholi

26:42.280 --> 26:47.960
or work on shader compilers or any of that. It was mostly just breaking assumptions inside the NVK.

26:49.960 --> 26:55.240
So when the NVK was brought up, it was brought up on hardware that is basically designed to

26:55.240 --> 26:59.960
geolius through that one. So Votex shaders, fragment shaders, computers. And it made a lot of

26:59.960 --> 27:08.200
assumptions about just only having those three shader stages. One thing that is assumed is

27:08.200 --> 27:13.880
assumed that all API stages map one to one to the hardware stage. So there were various places

27:13.880 --> 27:20.120
in the state setup code where instead of passing around the actual API stage that we're looking

27:20.120 --> 27:26.600
at or passing around pointers to the data structures containing bindings, we just passed the

27:26.600 --> 27:30.840
shader around and looked at the stage from the shader and used that to go fetch stuff.

27:33.400 --> 27:38.440
That isn't going to work. Another thing that we assume to compute shaders are computers.

27:39.240 --> 27:44.360
But now we have computers that are not compute shaders. So we had to break some assumptions there.

27:45.160 --> 27:49.320
We assume that Votex shaders are always a Votex shader. Again, not a valid assumption.

27:50.520 --> 27:55.640
And we assume that jabbit shape always comes from the Vulcan API, which again isn't a valid assumption

27:55.720 --> 28:06.040
anyone. So what do we have to do? So the first one, like I said, a lot of this came down to just

28:06.040 --> 28:11.560
various places where we were kind of lazy inside the driver and we said, hey, I don't need to pass

28:11.560 --> 28:16.360
this extra data around and passing around the shader anyway. There's a stage enum inside the shader.

28:16.360 --> 28:20.120
I can just fetch that and look at it and figure out what my bindings are or whatever.

28:21.080 --> 28:27.960
That doesn't work anymore because sometimes you shader stage is not what your shader stage looks like it should be.

28:29.880 --> 28:33.160
Mace of shader geometry does not exist after glowing.

28:34.280 --> 28:39.640
If you look at the Vulcan shader object, you can see that it's a VK geometry shader and if you poke

28:39.640 --> 28:44.520
it that bit. But if you look at the thing that the back end compiler produced, it produced computers.

28:45.160 --> 28:52.680
Not jabbit shaders. Also the jabbit shaders produces vertex shaders, which are not,

28:52.680 --> 28:55.640
which are kind of vertex shaders, but they also kind of not vertex shaders.

28:57.240 --> 29:03.160
That was confusing inside the driver. So a lot of this came down to just going through each point

29:03.160 --> 29:08.200
that looks at bindings, looks at anything that depends on the shader stage, and intentionally

29:08.600 --> 29:14.520
kind of passing all of that information to the whole way. We're setting up

29:15.480 --> 29:20.360
descriptive sets for jabbit shaders now. So let's pass the point of the jabbit shader state

29:20.920 --> 29:24.040
and pass the shader to the separate things so that we set up the late thing.

29:24.040 --> 29:28.360
And we're not switching non shader stages and we're looking stuff up and getting it long.

29:31.800 --> 29:36.280
We also had some fairly deep assumptions built into the compiler that computers were

29:37.000 --> 29:42.600
the biggest issue here was the way we do push constants and route descriptors.

29:44.840 --> 29:49.560
Where we had a separate push constant layout for 3D stuff and compute stuff,

29:50.200 --> 29:56.520
which used to be fine and a good idea because you need different system values for compute as

29:56.520 --> 30:05.240
you need for 3D. In 3D you don't need a compute shader size. You don't need the compute

30:05.240 --> 30:12.920
shader base indices and similarly in compute you don't need like you base vertex. So we had

30:12.920 --> 30:20.760
these separate things, one for 3D and one for compute. And if the shader stage was made

30:20.760 --> 30:25.240
a shader compute then you got the compute one but we suddenly need to peel the dispatch those

30:25.240 --> 30:30.280
as if 3D shaders. So there were some assumptions that had the compiler that we had to break.

30:31.240 --> 30:36.200
We also had some common system values that we showed between 3D and compute.

30:36.200 --> 30:40.520
That won't hit the same offset and so it would just look at the shader stage and be like,

30:40.520 --> 30:44.600
ah, use this offset now instead of this one but then we would bind that as a 3D shader

30:45.400 --> 30:48.040
and give it system values for 3D shader and it would blow up.

30:49.640 --> 30:54.280
So it was kind of an audit process of going through the shader compile and making sure that we

30:54.280 --> 30:56.280
didn't screw that up on ourselves.

30:56.520 --> 31:06.520
So we had this concept of a vertex shader inside the driver because you have a vertex shader,

31:06.520 --> 31:14.840
you can find them, everyone. And we had this split into two pieces. So one piece is the API

31:14.840 --> 31:21.240
of a vertex shader with a software of a vertex shader and this is the thing that we used

31:21.240 --> 31:28.520
a vertex shader. It's the thing that uses any descriptors that are bound for a vertex stuff.

31:31.480 --> 31:36.440
It's the thing that has to deal with multi-dler which is a weird pan-fast detail where we have to

31:36.440 --> 31:43.640
do some descriptive patching for the inputs in order to get attribute devices correct but there's

31:43.640 --> 31:50.040
some commands between fiddling we have to do for that. But that again only applies to the actual

31:50.040 --> 31:57.160
like API vertex shader. And so we have this stuff, this is what we call the API vertex shader

31:57.160 --> 32:01.080
but this might actually not be a vertex shader. It might be the vertex shader or it might be a computer shader.

32:02.280 --> 32:08.280
And so we had this sort of separate this often to this concept of a vertex shader is the

32:08.280 --> 32:11.800
full shader that runs. Whether it's a computer, the vertex shader doesn't matter,

32:11.800 --> 32:15.240
it's the first one in the pipeline. The second concept we had to have

32:16.120 --> 32:21.960
is a hardware vertex shader. So this is the last shader in the diagram of your pipeline.

32:21.960 --> 32:27.160
And this is the one that actually runs as part of tiling-bending that whole process.

32:28.280 --> 32:34.280
I'm mulling the whole IDVS and this might be the vertex Vulkan shader,

32:35.080 --> 32:39.320
the vertex shader it might not be. It might come from jampe, it might come from deceleration

32:39.400 --> 32:45.800
but it is whatever it is, it's the last shader that runs. And so we had to build this divide

32:45.800 --> 32:52.440
inside the variable that didn't really exist before because we just assumed that vertex shader is

32:52.440 --> 33:00.360
vertex shader and the code was all entangled and we had a very cleanly cut this.

33:00.760 --> 33:10.760
The next issue, which was not obvious to me when I started this, but it became obvious

33:10.760 --> 33:18.600
as we went along, is that you Japanese state does not necessarily come from Vulkan. So they're

33:18.600 --> 33:23.880
not very many dynamic states that impact geometry. When you look at the dynamic state for the

33:23.880 --> 33:31.960
3D rendering pipeline, apart from vertex input bindings, most of it is for acceleration.

33:31.960 --> 33:39.800
Most of it is depth states, blending states, primitive modes, all of this stuff is rationalization

33:39.800 --> 33:46.760
stuff. The only things that actually affect the geometry pipeline, apart from, again, the vertex

33:46.840 --> 33:56.040
bindings, are the primitive topology, do we start neighbor, the index buffer, and the size

33:56.040 --> 34:06.200
of an index in the index buffer. Unfortunately, in the CSF backend, which is for the new

34:06.200 --> 34:13.160
emology piece, we were doing all of the usual graphics library tricks of track the doughty bits,

34:13.160 --> 34:19.080
see when something changes, pull it from the Vulkan API, shove it into the hardware, all of that

34:19.080 --> 34:27.640
usual song in it. But the problem is that the states might actually come from Lidpoly,

34:29.800 --> 34:35.400
which means that it's not coming from the Vulkan runtime, sometimes we need to overwrite all of that

34:35.400 --> 34:39.720
and fetch it from Lidpoly, and sometimes it comes from the Vulkan runtime, depending on whether

34:39.720 --> 34:47.160
or not you have a geometry to transfer feedback or whatever else. So what we end up doing is

34:47.160 --> 34:53.320
plumbing it through the dry itself. So we have this struct that is in entire draw, and when you do

34:53.320 --> 34:58.440
like VKC and VD draw, it constructs one of these, and it passes it onto the one draw, and

34:59.480 --> 35:04.520
those fields for indirects, those fields for all of these different things that you can control it,

35:04.520 --> 35:12.840
and we can sort of pass it all to a central point. But what we end up having to do is make it so that

35:12.840 --> 35:19.960
these things will part of this sort of draw state that gets passed atomically so that it could be

35:19.960 --> 35:29.480
modified by Lidpoly. Because when you have Lidpoly, you're rendering pipeline actually looks

35:29.480 --> 35:38.360
more like this. So it's not like this linear set of stages. I mean, it kind of is, but it ends

35:38.360 --> 35:44.280
of looking more like you have some vortex data, which you have to provide a way to fetch the

35:44.280 --> 35:48.040
vortex data in the use for this compute shader. That's a problem that's left to the driver.

35:49.960 --> 35:56.040
On some hardware, that means doing some sort of a shader preamble and just doing global memory.

35:57.000 --> 36:05.160
On Mali, we have descriptors for vortex data. And there's a scripted type, we just

36:05.160 --> 36:10.360
populate those, it's part of the descriptor set, and it feeds into the flow shader stage. It's actually

36:10.360 --> 36:16.760
not a whole problem for us to solve. But you have that, it feeds into this big blob of compute shaders.

36:18.840 --> 36:22.040
Some of those compute shaders are going to take shaders, and they fetch the flow texture

36:22.040 --> 36:24.840
descriptors, some of them are geometry out of shaders, and they fetch from geometry descriptors.

36:25.480 --> 36:31.720
And then what that blah blah stuff produces is draws. And the draw on this side and the draw on

36:31.720 --> 36:38.200
that side might not be the same draw. So you might have something that starts off as a regular

36:38.200 --> 36:41.640
plane and simple draw that ends up being an index draw. You might have something that starts

36:41.640 --> 36:47.080
to plane and simple draw that ends up being a draw in direct. You might have a big multi draw

36:47.080 --> 36:54.040
that ends up being a single draw in direct. And so you need to be able to mutate this draw

36:54.040 --> 37:03.000
state as a separate concept from the volcano API, because draws are the output of this pile of

37:03.000 --> 37:11.160
code. And then those draws feed into the fixed function pipeline and rationalization and

37:11.160 --> 37:15.720
fragment shading, just like everything else. But you end up with this thing that's kind of

37:15.720 --> 37:22.520
bolted between your vertex data and your astrization pipeline. You have this extra blah blah stuff

37:22.520 --> 37:26.040
that you have to insert whenever you have one of these cases where you need lip poly.

37:28.680 --> 37:34.280
Okay, that's a lot of work. Bunch of refactoring is a bunch of churning the driver. What do you get out of it?

37:35.240 --> 37:44.600
So, what lip poly can do? Lip poly can do geometry. It can do transform feedback.

37:47.320 --> 37:51.640
Once you have everything running in computer shader and you've sorted out all the problems to get

37:51.640 --> 37:56.760
your pivotors in the way to order doing transform feedback is actually pretty easy. You have to

37:56.760 --> 38:03.480
pass a little bit of extra data for your buffer bindings. But otherwise it's just global data rights

38:03.560 --> 38:09.960
from the geometry shader. It can also do pipelines and statistics for you. So again,

38:09.960 --> 38:15.800
things like primitive counts, stuff like that that you might not have, and might not even make sense

38:15.800 --> 38:22.520
in a Tyler because Tyler is in primitive counts of weird because it all gets bins. We can

38:22.520 --> 38:29.160
implement that in a consistent way that makes it look like any other, any other like desktop driver.

38:29.720 --> 38:34.920
You get the same statistics up. And one of these days we'll get test relations moved into lip poly so

38:34.920 --> 38:46.040
you'll get that too. So, let's talk about where we are. Lip poly is just separate from

38:46.040 --> 38:51.720
EGX. It's already been pulled out. I had to do a bunch of work on it to get it working on Molly because

38:51.720 --> 39:00.600
there were some Apple GPU assumptions in there like it assumed. 1,024 work group sizes. It assumed

39:01.480 --> 39:08.440
32 wide subgroups and some other stuff in some of the, particularly the prefix sum algorithm

39:08.440 --> 39:14.840
and some of the algorithms around. Primitive we start had did subgroup ups and they had a bunch of

39:14.840 --> 39:20.200
assumptions about work group sizes and subgroup sizes that don't actually apply the pancake. So we

39:20.200 --> 39:25.080
had to fix that and generalize all of that. That work has been done. It's upstream. Through factors

39:25.080 --> 39:31.160
are done. We managed to do without waking us out here. We'll get. The pair of equally factors that

39:31.160 --> 39:36.600
I talked about about 50% upstream, about 50% of it is still living in a branch because I got this

39:36.600 --> 39:41.480
all working and then had the both under something else that was pressing and I haven't gotten back

39:41.480 --> 39:48.600
to it, but we'll be over the course of the next month or two. Yeah, let's look at what we're

39:48.600 --> 39:52.680
doing. Mostly we're going to branch needs a little bit cleaning up. The transform people can

39:52.680 --> 39:57.560
queries. We know they work because they work for Ajx. I know how to hook them up. It's just

39:57.560 --> 40:02.040
the matter of like actually adding all of the volcanometry points and plumbing it through.

40:03.400 --> 40:07.800
There shouldn't be any hard engineering work there, but there is some work that needs to be done.

40:09.640 --> 40:13.720
But it is mostly working. We have geometry training happening on PNVK. It is passing

40:13.880 --> 40:21.480
thousands of tests. There's a couple of bugs that have been fixed since the last time I

40:21.480 --> 40:27.000
re-based my branch, so that should be down to like 20 test failures. I think once I re-based.

40:29.400 --> 40:32.520
And yeah, you too could have JavaScript to share this. If you want them.

40:34.680 --> 40:41.000
Mostly I expect this to be interesting to Molly people and people who are working on an

40:41.080 --> 40:46.760
Apple hardware. But if you have some other title, would you have to share your giant pain?

40:46.760 --> 40:53.960
This might be easier. I don't know. And that's it. Any questions?

41:06.280 --> 41:07.560
Go ahead. I don't think we have Mike.

41:11.000 --> 41:18.360
You want to make the version. When you go to the site, you want to ask an invitation and you want

41:18.360 --> 41:27.480
to run a very small, don't share it in all steps. How do you make the resolution of the

41:27.480 --> 41:33.320
geometry? Okay. So the question is how do we actually produce this geometry

41:33.320 --> 41:40.520
that just writes out the number of primitives? It's actually pretty easy. So we already have

41:40.520 --> 41:46.440
some passes in Mesa for various hardware. It's pretty common for hardware to want to know the

41:46.440 --> 41:53.160
number of primitives. So the way it looks like in GLSL as you have this in it primitive thing.

41:54.360 --> 42:02.920
But there's a bunch of hardware that the way that it works is you emit like you pass it in index,

42:02.920 --> 42:07.560
like you say, this is the 0th primitive, this is the second primitive, this is the third primitive,

42:07.560 --> 42:12.760
and then at the end you say, and I had four primitives. And so we already have a lowing pass,

42:12.760 --> 42:20.040
which turns these in it primitive calls into basically adds a counter for the number of primitives,

42:20.600 --> 42:26.760
and it increments that counter, and then at the very end of the shader, it has an intrinsic

42:26.840 --> 42:33.800
to write the counter out. Well, what we do is we just take that intrinsic to write the counter out

42:33.800 --> 42:38.440
and we turn it into something that writes to memory, and then we delete the rest of the shader.

42:41.160 --> 42:45.800
So if you just delete all of the other stuff that has side effects, and you only leave the

42:45.800 --> 42:50.120
one thing that writes memory, then the compiler said code pass will just delete the rest of the shader

42:50.120 --> 42:57.880
and, well, you have this simple version. And then looping rolling in some of those passes

42:57.880 --> 43:05.560
are hopefully going to be smart enough to be able to turn that into just the loop counter or something.

43:05.560 --> 43:12.120
But if it can't learning a loop to just add an integer isn't actually that expensive,

43:12.120 --> 43:19.000
so it's kind of fine. But yeah, it's a problem that looks like a code. Like, oh my gosh,

43:19.000 --> 43:22.600
we're going to take this geometry here and we do sit, but it really just comes down to

43:22.600 --> 43:28.840
running this pass, doing something useful with the final intrinsic at the end, and then

43:29.800 --> 43:32.680
deleting some things and letting the compiler dead code pass go to town.

43:37.080 --> 43:40.760
So what happens if that one mentioned buffer does run out? What does that look like from a

43:41.560 --> 43:47.800
color sign? What happens if the buffer looks out? Lend out. Um, things probably blow up.

43:51.560 --> 44:02.040
It's, yeah, we, I think we could probably fairly easily detect that case and just stop

44:02.200 --> 44:09.560
knitting geometry at that point. But, um, that's kind of the best that we can do without

44:09.560 --> 44:16.760
doing a whole giant, like, super dynamic, stop the entire pipeline, flush everything out,

44:16.760 --> 44:23.880
we started. Like I said, that kind of already exists for Molly because Tyler GPUs tend to run

44:23.880 --> 44:29.560
into these problems, and so there's an annoying amount of engineering effort that has gone into

44:29.560 --> 44:36.520
being able to flush the Tyler and restart on some of these hardware, we have something that

44:36.520 --> 44:42.680
has to do that for Molly. Um, but then tying that into the pile of computers that we have

44:43.320 --> 44:50.120
and making it so that those are also restartable. Not as easy as one would like it to be.

44:50.920 --> 44:57.800
It is probably possible if we run into this in the wild and we need a buffer that's way bigger

44:57.800 --> 45:04.360
than anything we're using today than maybe that's worth it. But right now, allocating, you know,

45:04.360 --> 45:09.960
64 megabytes or something, the moment we see a geometry to, is good enough for the X12 titles,

45:09.960 --> 45:12.040
so it's good enough for anything you're going to run on Molly.

45:18.040 --> 45:18.360
Good.

45:18.360 --> 45:28.200
I think he said that it's possible on Molly to feed the rasterized directly, like you can just

45:28.200 --> 45:33.400
write something so on ship, so on ship memory and then that. So, okay, asking about it, feeding

45:33.400 --> 45:39.560
the rasterized directly. So, Molly has gone through a series of changes regarding how the rasterized

45:39.560 --> 45:47.080
it gets fed. Um, on the super old ones, the Votex shader was literally just a compute shader

45:47.160 --> 45:52.600
that ran over every Votex and produced the outputs in a special buffer format, and that's what

45:52.600 --> 46:00.200
the Tyler consumed. Um, later Molly's has shifted more and more towards the hardware having

46:00.200 --> 46:07.320
more direct management over those buffers. By the time you get to the Molly's, we're running

46:08.120 --> 46:14.120
geometry, this is actually interesting. Um, writing into that buffer for directly isn't really

46:14.200 --> 46:24.280
something that's practical. Um, it is all just memory. We could probably do something, but at that

46:24.280 --> 46:30.760
point, it's probably not really worth it because the caches and stuff will already set up to be

46:30.760 --> 46:36.280
able to handle, taking the output of the Votex shader and shoving it into rasterization, and so

46:37.000 --> 46:42.520
just having that dummy shader that does the copy is probably okay, and not going to mold

46:42.840 --> 46:48.680
performance. Um, and again, we don't, it is getting to the point where it is so tightly managed by

46:48.680 --> 46:55.960
the hardware that trying to write to ourselves is maybe not real practical. Did you have a question?

46:55.960 --> 47:01.960
Yeah, um, to let Molly be, would it be possible to either have a little, little, little, little

47:01.960 --> 47:07.560
little to implement, uh, stable counters from hardware like Mississippi? What counters?

47:07.640 --> 47:17.080
Stipple counters. Um, what do you possibly line stippling? Um, possibly, it depends on,

47:19.640 --> 47:26.760
it depends on what exactly we're trying to do and how we say that, but yes, I think that if you

47:26.760 --> 47:37.400
wanted to do stable, stippling across line lists, or it wouldn't be like, yeah, line strips,

47:37.400 --> 47:43.240
or line loops, um, that would be the place to do it because stipple counting is basically the

47:43.240 --> 47:49.240
prefix sum problem. You would have a compute shader that runs ahead of it and just adds up all of the

47:49.240 --> 47:55.480
line lengths, and then do the same prefix sum to figure out your offsets and then you could figure

47:55.480 --> 47:58.600
out your stipple patterns and that. So I think it would probably be the way it plays to do that.

47:58.600 --> 48:03.720
If we, if we really decide that we care about reproducible stiffering on Molly,

48:05.960 --> 48:09.640
the reason why I'm asking you to do a lot of more well-debuted stuff missed this picture and

48:09.640 --> 48:17.880
about failing test. Yeah, um, again, I'm not sure there's a lot of Molly, a lot of, like,

48:17.880 --> 48:21.880
workloads that really care about that, but I'm going to run on those sheep ears because that tends

48:21.880 --> 48:27.720
to be stuff like cat applications, but, yes, we could probably be line-stop-way. Why not?

48:28.840 --> 48:32.120
Once you have anything to do to the compute shader, we'll do oyster.

48:33.800 --> 48:39.800
So in the case where you have this variable amount of vertices per chi-estred,

48:40.360 --> 48:44.280
you mentioned that you do four passes. My question is,

48:45.000 --> 48:51.480
could you do it just with one pass, but allocate the maximum amount of vertices instead of, of,

48:51.880 --> 48:55.560
course. So the question was, in this case where the four passes,

48:55.560 --> 48:59.960
can we just do it with a single pass, if we allocate the maximum amount. And the answer is no,

48:59.960 --> 49:03.400
because we still have to be able to produce the primitives in the correct order.

49:04.200 --> 49:08.120
So, part of the problem is figuring out the numbers that we can allocate memory, but the other

49:08.120 --> 49:12.840
problem and the whole problem is making sure the primitives are output in the correct order.

49:12.840 --> 49:19.000
We can't just have an atomic that we increment from the compute shaders to pick where the

49:19.080 --> 49:24.920
primitive goes, we have to do all of the primitives from the first geometry shader,

49:24.920 --> 49:28.520
all the primitives from the second geometry shader, all the primitives from the third geometry shader,

49:29.240 --> 49:33.960
because we need predictable pasteurization order, well, depth testing and color blending breaks.

49:40.680 --> 49:42.600
If you use a what, you'll speak up.

49:42.600 --> 49:54.920
So the question is, why would the order change, if we did, I don't know what that was,

49:54.920 --> 50:00.680
the atomic kind of thing, or, like, so the problem is we're trying to process all of them

50:00.680 --> 50:05.000
in parallel, right? We have computers that are lying, we have thousands of primitives that are

50:05.000 --> 50:12.040
happening, and they all have to come out in the way to order. We don't know where they end up going

50:12.040 --> 50:18.120
in the output buffer, for transfer, for, if we're not doing transfer, we could theoretically

50:18.120 --> 50:27.800
make the output buffer space and just skip some primitives at times. That would probably work,

50:27.800 --> 50:33.080
as long as we make the geometry shader write out some sort of a empty primitive that has,

50:33.080 --> 50:41.160
like, if we can do a zero, but that still doesn't work for the transfer in feedback case.

50:41.160 --> 50:45.880
So when we have transfer in feedback, we have to pack it tight, but yes, we could theoretically

50:45.880 --> 50:52.680
emit degenerative primitives for instead of eliminating them and doing the packing step.

50:54.200 --> 50:59.160
Is that going to be more efficient? I don't know, because that means you're going to be

50:59.720 --> 51:05.000
emitting a lot more than we write, because you have to fill your entire maximum space with

51:05.800 --> 51:13.080
these zero size primitives. Maybe it's when, maybe it's not, not sure. The pre-fixum itself is

51:13.080 --> 51:19.320
really fast, so I suspect it would actually probably be a loss in the common case.

51:20.280 --> 51:28.600
And the common case, if we're honest, is either a predictable number or, like, just deleting

51:28.600 --> 51:36.360
some of the primitives. So, maybe we'll be fine, but yeah, it's not a 100% clear where the one

51:36.360 --> 51:39.080
would be there. I think that's probably the last question.

51:40.760 --> 51:45.400
So those lies you've had, like, you mentioned that you are a self-being of the futures that

51:45.480 --> 51:50.440
are going to open CL. Why open CL? Why is that we open CL in the graphics?

51:50.440 --> 51:54.200
That's a very good question. So the question was, I mentioned some of the good internal

51:54.200 --> 51:59.000
shaders have written an open CL, why open CL? And the answer is that it's just a,

51:59.000 --> 52:04.920
hello, a lot with easier to write open CL code than GLSL or some of the other options that we have.

52:06.760 --> 52:11.560
This is especially to, when you have to start poking it memory, because in open CL, we have actual

52:11.560 --> 52:20.200
pointers. And we don't have to try and shoehorn everything into SSBO bindings that get turned into

52:20.200 --> 52:25.240
the scripted set that then get bound. We can just do a point to this ground and it's all good.

52:26.440 --> 52:35.240
It also means that we can use, we can have a header file that gets included in C code on the

52:35.240 --> 52:42.040
CPU and gets included in open CLC code on the GPU and both of those can talk to it. So, like,

52:42.040 --> 52:48.040
all of the sort of sideband state setup stuff for this that controls, that sort of controls it,

52:48.920 --> 52:54.360
it's just structs and you just fill them out with data and you stick them in GPU visible memory

52:54.360 --> 52:58.120
and then the open CL code just goes and leads from those tracks in all looks and we don't have to

52:58.120 --> 53:02.760
like duplicate anything and bangles had been trying to make sure that the GLSL layout is the same as the

53:02.840 --> 53:08.120
C layout, we can just use pointers and it works in exactly things that much easier.

53:10.120 --> 53:13.320
We could do it without that, but it just makes our lives live easier.

53:17.320 --> 53:24.760
What we're talking about for device address, we could do that too and, like, that's basically

53:24.760 --> 53:29.240
what point is turned into, but a lot of it also comes down to the fact that it's really nice to be

53:29.240 --> 53:38.120
able to just write C code. Especially since four things like the test later, they were unit tests,

53:38.120 --> 53:44.520
the unit tests run on the CPU version and we can actually test and develop and debug stuff

53:45.640 --> 53:52.440
and then put it on the GPU and so it's not that it's strictly necessary, like we could do all

53:52.440 --> 53:57.480
of this in GLSL, we could do all of this in, you know, whatever other language, we could do it all

53:57.560 --> 54:03.320
directly in a no-builder, it's just that from a quality of life perspective and a trying to make

54:03.320 --> 54:08.040
it so that we can develop this without pulling the hair out perspective, just writing up a little

54:08.040 --> 54:13.560
lot easier and that's what it comes down to. So I think that's all the time we have, so thank you very much.

54:27.560 --> 54:29.560
Yeah.

54:38.600 --> 54:44.600
Yeah, we are running test box there as well and each is barge is in this morning and it's all the

