WEBVTT

00:00.000 --> 00:26.900
All right, take a seat, I'm going to listen to a talk about OpenMP and GCC, for

00:26.900 --> 00:34.100
forget that these slides are already up in the first-name page for this session, so you

00:34.100 --> 00:39.420
can find the slides there, so you can still take pictures with one, but you can also find

00:39.420 --> 00:40.420
them there.

00:40.420 --> 00:49.900
I start a few words about myself, I'm a contributor to the OpenMP specification and also

00:49.900 --> 00:58.020
to the relevant parts in GCC itself, so offloading OpenMP for Trun, which is how I started

00:58.020 --> 01:05.380
having background-time computational physics, so that I came into contributing to G4 Trun,

01:05.380 --> 01:11.460
otherwise I'll work nowadays for Belieber, Belieber, that's where it gets paid, I've

01:11.460 --> 01:17.580
ever heard, do some work related to toolchains, like compilers, debuggers and so on, and

01:17.580 --> 01:23.460
the other part, you do a bit more on the Linux kernel and related things embedded and

01:23.460 --> 01:30.260
right and so on, that's background, that's which to GCC, in the OpenMP and offloading

01:30.260 --> 01:41.300
part, just seem meanwhile as support or general of OpenMP and OpenMP and OpenMP, there's

01:41.300 --> 01:47.980
a proper complete support of 5.2, of course, some gaps and the larger gaps, but not the

01:47.980 --> 01:54.820
only one, it's missing OpenMP and OpenMP support, so toolchains and debugging, there's

01:54.820 --> 02:00.020
another this link, especially in the first one, you can see a bird's audio, which features

02:00.020 --> 02:07.740
implemented and not an OpenMP, so yes, no, like any second one link there, which is

02:07.740 --> 02:11.860
a first specific version, but in the first link one really says in this version, this

02:11.860 --> 02:20.220
features, supported this basic support, offloading support, Nvidia and NVGPUs, some information

02:20.220 --> 02:26.020
about building them, where to get them, I mean, dist for a server, but there are optional

02:26.020 --> 02:33.220
packages and something so to build them, there's another link in the wiki and, well,

02:33.220 --> 02:39.460
GCC15 is the newest release, which is now almost one year old, and I've recommended

02:39.460 --> 02:45.180
one for MP because a lot of missing features are finally in there, and it's relatively

02:45.180 --> 02:51.180
complete 16, maybe, but let's change this, but a lot of changes here and there, one

02:51.180 --> 02:57.460
of it changes, for instance, the MI300 support, we finally got access to one, thanks to

02:57.540 --> 03:06.820
you as your frorigan, so some idle time, this one could add it, in the talk itself, I will

03:06.820 --> 03:12.820
talk about OpenMP in top, so some examples coming there, and OpenMP, I hope, is supported

03:12.820 --> 03:19.780
in GCC15, only maybe one can say since it's a bit older feature, but it contains most

03:19.860 --> 03:25.620
of the OpenMP's six features and particularly as for France support, and since, on P6,

03:25.620 --> 03:31.940
I guess, most, or several of the other compilers will also support at least the seabersion

03:31.940 --> 03:39.940
of the inter-op, if I can text the latest version, and on the GCC side, planned, is of course

03:39.940 --> 03:46.980
the usual things like bug fixes, filling some gaps on either corner cases or missed items

03:47.060 --> 03:55.220
and some completely missing 5.x features, and with upcoming and work, actually, none

03:55.220 --> 03:59.860
or supposed to be started soon as the next two items are flood performance improvements in

03:59.860 --> 04:08.260
the OMPT, but don't expect too much of this work, I mean, some parts of performance work

04:08.340 --> 04:13.300
of course enter always, but some bigger learnings are expected for the next GCC version,

04:14.820 --> 04:20.740
and well, like always, if OpenMP's awesome, problems going to be it has work come,

04:22.020 --> 04:28.580
code documentation, back reports, and I guess there will be Google some of code this year,

04:28.580 --> 04:35.780
so if someone has an idea of a project, I think help us always work

04:38.260 --> 04:46.500
well, let's move to OpenMP and inter-op is a bit kind of the starting point, one has existing code,

04:46.500 --> 04:53.140
principle does not really need to have OpenMP, it still has some advantages, but it mostly makes sense

04:53.140 --> 04:58.580
if one has OpenMP code, whatever host side of load does not really matter, and one has of course a GPU,

04:59.860 --> 05:07.220
and then there's this nice vendor libraries by vendors like Nvidia, in the end zone, which provides

05:07.380 --> 05:13.540
some optimized version running on the GPU for fast 4 year transformation, basically in the

05:13.540 --> 05:21.620
Alba and so on, and of course one wants to use them, and inter-op makes not a bit easier, or

05:21.620 --> 05:29.140
I will provide some complexity, the compatibility or portability, especially with the OpenMP part,

05:29.140 --> 05:34.660
but they are also in general, and later part, and it comes more important for dependency from

05:34.660 --> 05:41.540
Google's concurrent, so I've picked some example, picking good up, picking C, where there's the

05:41.540 --> 05:48.180
Blast call in the middle, and this example of course one has to transfer data to the device,

05:48.180 --> 05:57.860
and create this streaming object and use it, and part of this one can replace by OpenMP,

05:58.660 --> 06:04.900
so the first one we handled it for social part, which has nothing to do with the inter-op feature,

06:06.100 --> 06:11.380
we copied the data, I'll equate the data on the device, and this example, the MP target one

06:11.380 --> 06:16.980
not initialized on the device, then there are various ways to get the point on the device side,

06:16.980 --> 06:24.020
or it just was choosing the MP get map pointer to get it, and then once it has to create the

06:24.020 --> 06:30.740
stream and so on, which I skipped here and use it, so that's quite nice, but one wants to do more,

06:31.620 --> 06:37.380
and of course OpenMP provides various addition ways of getting things, like allocating memory

06:37.380 --> 06:44.660
directly on the target as a team, or provides allocation, managed memory, at least this

06:44.740 --> 06:50.900
minute extension, you fetch memory, then one can avoid some of these issues, but it's a side

06:50.900 --> 06:59.780
to remark, but now let's go to Inter-op, centre-op, one gets to direct the MP, Inter-op in it,

06:59.780 --> 07:06.260
and at the end destroy, where one has an object, and then one asks, what one wants to have,

07:06.260 --> 07:15.380
the side case target, which provides some basic information about the system, and then one can add,

07:15.380 --> 07:21.620
what this is the preferred type, if it's not available at one time might do anything, if you have

07:21.620 --> 07:28.980
something different, because let's say an MPGP and doesn't have gooda, and or can say device and

07:28.980 --> 07:35.620
proper things, and then one gets an object, if it's available, then one can check what type it is,

07:35.700 --> 07:41.700
run time, interrupt end, so from run time or back in this case, it's gooda like expected,

07:41.700 --> 07:47.540
and then one can also get to, for instance, the device number, but that's not really the most

07:47.540 --> 07:55.060
interesting part of doing it, but come to the more interesting part next time, slide, and the GCC

07:55.060 --> 08:01.300
compiler for Nvidia GPU is gooda, gooda driver hipers supported, find the hip and HSA,

08:01.700 --> 08:11.700
hipers, essentially, gooda except of C, or gooda in the API would change the place by hip,

08:11.700 --> 08:17.300
and the interesting part therefore it's also hip for the Nvidia one is that they have

08:17.300 --> 08:23.380
essentially in the header stay referred to the gooda version so there's no overhead, and therefore

08:23.380 --> 08:28.340
also in GCC site implementing is easy, because actually all the free types that return

08:28.340 --> 08:34.500
exactly the same, which is not the case for hip HSA, and what's available, something on next

08:34.500 --> 08:41.780
life is also that's well defined, and of course depends on the compiler and device and so on

08:41.780 --> 08:48.740
what's really available, well now it gets the more interesting part that's replacing this free

08:48.820 --> 08:57.860
create and destroy, and there one can see, so if you target sync, the case is streaming

08:57.860 --> 09:05.860
object, and then one can obtain using the API routine as shown to get the stream out, and then

09:05.860 --> 09:11.620
at the end one destroys the stream and just waits until the stream is completed, so that's already

09:11.700 --> 09:20.660
make some things, but easier and more, useful here, and let's quickly show that if I'm

09:20.660 --> 09:27.300
looking in the OpenMP manual, not any specification but division, you can add one sees what

09:27.300 --> 09:33.860
is available there, so for gooda's streaming object and the other types, so hip, hip stream,

09:33.860 --> 09:41.300
and some others, and of course it's already first up to make it a bit less library dependent,

09:43.540 --> 09:51.780
and now the really fancy stuff is where you can't do it manually, is if you combine it with OpenMP

09:52.820 --> 09:59.780
with dependencies and as an Corona, so no wait, as an example, so in this example, when initializers

09:59.860 --> 10:07.620
comes from the OpenMP sample document, when initialize the data and has some outgoing dependencies

10:07.620 --> 10:14.340
and then the Interup object, since it then takes those values, one has the dependency, and

10:15.860 --> 10:23.460
likewise, while the calculation is ongoing, one has the outgoing dependency, so the destroy

10:23.460 --> 10:28.500
waits essentially, until the data's ready, and then one can use it later, so that's I think

10:29.300 --> 10:37.620
most interesting feature for Interup lift depend, it's like the unrelated disc sectors example,

10:37.620 --> 10:43.620
but on in the box I show that one can also use it in Fortran, as mentioned, then I also want to

10:44.420 --> 10:51.620
add some hip example, and especially for Fortran site, the nice thing is that hip Fortran is

10:51.620 --> 10:58.020
available as Fortran file, and can very easily just compile it with whatever compiler one has it,

10:58.020 --> 11:05.460
and then use it, could have Fortran is a bit more difficult to get compatible with our compilers

11:05.460 --> 11:13.220
and otherwise it's pretty much the same way. The other thing I want to talk to make it easier to use

11:13.220 --> 11:21.940
variants functions, it's a normal variant, way of doing it, not yet for Interup, optimized, so one

11:21.940 --> 11:29.540
has a base function, and a variant function, which in terms of the arguments, I'm so as identical,

11:29.540 --> 11:34.580
has of course a different name, and then one can define the clear variant, and then another certain

11:34.580 --> 11:41.940
conditions, one does, variant is used, the one I marked with the match clause here, if I'm just

11:41.940 --> 11:46.580
called it, one gets the base function in front, called the base function, but sits before, in this case,

11:46.660 --> 11:51.540
the global variant is used very into a true, then the base function, it actually doesn't

11:51.540 --> 11:57.060
call it a base function, but the function, a variant function, and there are several different ways

11:57.060 --> 12:04.500
to define when one thing happens and the other. Okay, let's see some device function, again,

12:05.700 --> 12:13.220
this vector function we had before, from brass, so if you kind of create the device function, which

12:13.300 --> 12:19.460
looks essentially the same, there we hide the stream creation and the instruction in there,

12:19.460 --> 12:23.700
and then we need to convert to device pointer, which assumes that it's there,

12:25.540 --> 12:31.780
well, then we get this one, but there's a lot of things automatically built, or assumption built,

12:31.780 --> 12:39.700
and it's much nicer to handle differently, and there's now the clear variant, two things, one is a

12:39.700 --> 12:46.900
pentax, one can ask to add an inter-up object, so if we look at the function, when we have

12:46.900 --> 12:52.260
now an inter-up object, so what, then we can directly get a streaming object, the other thing we

12:52.260 --> 13:00.180
can do is adjust arcs, so we say take the device pointer, off that one and use it and that only works

13:01.140 --> 13:05.380
in order to make it a bit more controlled by having it wrapped in a dispatch,

13:05.700 --> 13:13.380
so only cause base function, and since devices, especially directly need to call on P, this

13:13.380 --> 13:18.580
or put on P, this patch in front of it, or kind of feature, my example, I put device, but they're

13:18.580 --> 13:27.380
much more, so essentially that's it on the inter-up feature, which kind of makes it easy to combine

13:27.460 --> 13:32.900
some things, so somewhat the mice code with Fogman P code, or wrapping some things around,

13:35.780 --> 13:41.460
the other thing, I want to say, and I see that I've much more time than I expected,

13:42.980 --> 13:48.900
it's a sneak preview of some other way to reduce the overhead with Fogman P,

13:49.780 --> 13:54.500
which becomes also interesting for competitively with cocos, because they move that way itself,

13:55.460 --> 14:01.460
oh, MMP has a bit of disadvantage that it can be slow, especially for flowings, and carries

14:02.740 --> 14:08.420
well, on one hand, a lot of external information like these control variables, or, for instance,

14:08.420 --> 14:16.020
if you start a device section with power, the compiler then decides how much threats it

14:16.020 --> 14:20.420
to be used there, but if you can go then in some function core, for instance, in this

14:20.420 --> 14:27.540
per region, you have another pair, you might not really have any parallel threats left, or

14:27.540 --> 14:32.980
the compiler has to reserve some, so there's, and you have also a lot of state information where you

14:32.980 --> 14:41.380
can set some state, and then later it gets used like number of threats and so on, and everything

14:41.380 --> 14:46.100
like that needs to be carried on, that makes things slow, and the other thing is also a little

14:46.100 --> 14:53.300
thing sub built in, like, holistically how many threats you use, so it works pretty well, but it

14:53.300 --> 15:01.140
can be quite slow, depends on the compiler, some manage to do very well, some somewhat,

15:01.780 --> 15:07.700
to see a bit less so, but yeah, it's also going to be getting there, but we'll be never as fast as

15:07.700 --> 15:14.340
this coulder and tip yet, and now I mean as language, so there are now ideas to have the full support

15:14.340 --> 15:20.820
of writing something with different syntax, which was there, so the first step has been already made,

15:22.020 --> 15:31.620
that I had so-called team private variables, which is in hip and kuda hanna and us got shared,

15:31.620 --> 15:37.700
which can be used all both for static variables and at kernel launch time, so, from as a kernel

15:37.700 --> 15:44.420
like a bit there, that's when I meant to have the size, in open mp, that already exists in open

15:44.420 --> 15:53.060
p6, the group private, clause, which is for static variables, and then also the group private,

15:53.060 --> 16:01.540
which is not yet in a real release, but in a particular report, and so it should be in 6.1 by

16:01.540 --> 16:05.460
no member, and the other one, which is a bit more work on progress, hopefully also makes

16:05.540 --> 16:10.740
until no member is the syntax where the number of blocks and threats, especially in a kernel launch,

16:11.620 --> 16:17.460
there's the idea of the syntax on the right, where we specifies the essentially the same way,

16:17.460 --> 16:24.020
so number of teams, and also number of threats where one has these free values there, and then

16:24.020 --> 16:31.140
it's inside the team dimension and so on, but with these dimension syntax, instead of the full one,

16:31.300 --> 16:38.580
one takes the full one, it's a number of teams would be automatic, calculated, but otherwise

16:38.580 --> 16:45.460
one gets the index, like shown there, and the next part, which is probably next, that one, because

16:45.460 --> 16:52.180
I haven't seen any work on this one, would be to disable features, which cause launch over

16:52.180 --> 17:02.660
from, so it's kind of what's upcoming next, and, well, actually, this short tour of all these topics

17:02.660 --> 17:09.220
done, and this is the side, give quick overview of my peer and offload support,

17:09.860 --> 17:16.580
enter up, and also could like features, should make things accessible, but

17:17.460 --> 17:22.740
here, for hyper-formance parts, where it's really hot coat, like the cooler hip part, and the

17:22.740 --> 17:28.340
end-hop, both are a bit easier, and you are seeing the vendor lips, there's a bit longer version of

17:28.340 --> 17:34.180
both of them, the end-hop one is actually a longer one off mine, there were one by someone else

17:35.380 --> 17:41.700
in the super-humidly buff tops, shown under this link, and that's small, that's for my side,

17:41.700 --> 17:48.900
I'll just give a small advertisement of an open-source embedded recipe, conference, my company runs

17:48.900 --> 17:53.940
in May, if someone is interested in this model system, and not on the end-latch one, and I

17:53.940 --> 18:01.300
don't know why, but that's it from my side, and I'm up for question,

18:07.860 --> 18:09.860
and we should have eight minutes, questions.

18:24.900 --> 18:32.820
That was the question about that before my peer makes easier to interact with the vendor lips,

18:32.820 --> 18:37.780
and where one still needs to link them, yes, I mean the only thing you really safe is

18:40.020 --> 18:46.580
if you have the initial one you have a cooler crate, seam, cooler best-spray, and doing all the

18:46.580 --> 18:52.740
melodic memcopy and so on, and the melodic memcopy, of course, you only have the

18:55.060 --> 19:02.420
cooler, I mean you already link those parts because of the runtime, so principle it would link

19:02.420 --> 19:08.980
the same library for already the basic off-mount support, for the Blast one it's a set of

19:08.980 --> 19:14.660
trustworthy places that essentially could as firm create and destroy, so you still need to link

19:14.740 --> 19:20.900
explicitly the Blast library, and have these cool Blast create, plus flat stream, and so on,

19:20.900 --> 19:29.300
as before, and the other one, I said it shouldn't be really changed because it's already there,

19:29.300 --> 19:36.820
but maybe for since some parts are not done at linked time, you probably may need to specify the link

19:36.900 --> 19:58.980
the library is where, although the compiler looks at the runtime, so the question was,

19:58.980 --> 20:06.660
whether one can specify some information in that, for instance, the target region doesn't spawn

20:06.660 --> 20:15.300
of more things, more threats and things like that, I think to a certain extent yes and no, I think

20:15.300 --> 20:21.540
it's nothing really well specified that one says this one does now this one, and it's really

20:21.620 --> 20:28.740
guaranteed, and it's used by the compilers, but of course open mp, I don't know, one was at five,

20:28.740 --> 20:38.740
five, two, maybe five dot one, had this is assume, and it assumes, which I mean was a bit bad,

20:38.740 --> 20:48.420
and there one can have of course saying no open mp contains, and so some principle could

20:48.500 --> 20:55.300
wrap the inner block by having saying no open mp parallelism, I think it's called, so there's

20:55.300 --> 21:04.820
some ways to do it, but I think it's not yet really fully utilized by the compilers, and

21:05.940 --> 21:11.300
it's a certain extent also bit to wake what one needs to put directly to make sure that

21:12.260 --> 21:19.940
the wanted feature is done, but in principle, one can do annotate things, but my feeling is a lot

21:19.940 --> 21:24.660
of things, it's not too much used, the whole expression is definitely used by compilers, where

21:24.660 --> 21:32.100
one says this expression is let's say always greater than zero, but with the others, I think this

21:32.100 --> 21:53.140
port is much less, and a bit of the stage to the last talk,

22:02.100 --> 22:04.570
I think it's a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a

