WEBVTT

00:00.000 --> 00:12.000
Hello everyone, my name is Jade, I work for HPE and I'm going to be talking about productive

00:12.000 --> 00:15.840
parallel programming.

00:15.840 --> 00:21.680
So we're in a room about HPC, there's lots of parallel systems everywhere, but sometimes

00:21.680 --> 00:28.880
they are harder to program than say a shared memory system, and you have HPC experts and

00:28.880 --> 00:37.400
you have users who may not be HPC experts and to actually use these big systems is challenging.

00:37.400 --> 00:42.560
You know, everybody knows OpenMP, for shared memory systems, this is like gold standard,

00:42.560 --> 00:49.800
it's great, but it's all compiler, directive based, and it's just in C or C++.

00:49.800 --> 00:51.440
Or for trend, yes.

00:51.440 --> 00:58.240
If you want distributed memory, you get MPI, Schmamm, pick your favorite technology, you want

00:58.320 --> 01:04.960
both, you use both, there's not one technology for both, and then we get to talk about

01:04.960 --> 01:11.360
everybody's favorite, GPU programming, and pick your favorite solution, there are many, many,

01:11.360 --> 01:17.440
many, many solutions, but the one big thing is all of those solutions are not distributed

01:17.440 --> 01:24.080
memory, so if you want GPUs and distributed memory and shared memory, you kind of

01:24.160 --> 01:29.520
hodgepodge everything together, and that's your solution, and so you have to be an expert

01:29.520 --> 01:32.160
in each of these technologies.

01:32.160 --> 01:39.200
So I'm not trying to say anything bad about these, they're all good, however, right, they're

01:39.200 --> 01:45.240
all based in C++ for trend, so you need to know C++ for trend.

01:45.240 --> 01:50.360
Some of these haven't changed in decades, I'm looking at UMPI, you know, it's been around

01:50.440 --> 01:54.440
for a long time, but we're still doing things the same way, we were 30, 40 years ago,

01:54.440 --> 01:59.720
long before I was born, and, you know, you have to mix everything together, so you have to be

01:59.720 --> 02:05.080
an expert in everything, and if you want higher level abstractions out of C, C++ for trend,

02:05.080 --> 02:12.440
you usually have to give something up, and so I say that HPC has a high barrier to entry

02:12.440 --> 02:20.120
because of all of this, can we fix that, I hope so, so capital is an alternative for

02:20.200 --> 02:26.040
productive parallel programming. It's a modern parallel programming language, portable and scalable,

02:26.040 --> 02:31.400
so I can run it on my laptop, I can run it on a supercomputer, open source, like all good projects,

02:32.280 --> 02:39.080
it's hosted under the HPSF, and the goal is to support general parallel programming and making it

02:39.080 --> 02:45.720
productive at scale. So if I have a GPU or a CPU, I should be able to target it from chapel,

02:45.800 --> 02:53.320
and I should be able to do that at any scale. So you should imagine a language that kind of has

02:53.320 --> 02:58.120
all of your favorite characteristics, I can read it and write it, it's performing, it's fast,

02:58.840 --> 03:06.840
I can scale up on my systems, I can target it, my GPUs, vendor agnosticly, and I can be portable,

03:08.040 --> 03:12.360
and hopefully it's fun, programming should be fun if it's not fun, you're doing something wrong.

03:12.440 --> 03:19.880
And all of this is what chapel is for, so let's look at some lovely code,

03:21.720 --> 03:26.680
you don't really need to read this, all you need to know is this is a benchmark, it's the screen

03:26.680 --> 03:34.760
try-on, it's just trying to drive the memory system, and there's a lot of code on this screen to

03:34.760 --> 03:39.240
basically do one little thing and one little for loop that's doing a multiplication.

03:40.200 --> 03:46.920
And so there's a lot of boilerplate and a lot of fluff just to do that. This is not productive.

03:47.960 --> 03:51.320
First is we pull in the chapel code, this is the entire chapel program.

03:52.680 --> 03:59.800
I can create my distributed array, that is, I created distributed domain, and then I declare my array

03:59.800 --> 04:06.600
over that domain, and then I just write what I want. This is the entire program, you can compile it,

04:06.600 --> 04:14.200
it works. And so this is our productivity, what about the scalability, it does the exact same thing

04:14.200 --> 04:23.400
as MPI. So much less code for the same result. Another example, even more code for a slightly

04:23.400 --> 04:29.560
more complicated benchmark, this is the HPCC remote atomic benchmark. If you're not familiar with

04:29.560 --> 04:34.440
the benchmark, you basically have a bunch of remote variables, and you are updating them all

04:35.080 --> 04:40.120
and committing them all the time. The nice thing about this code is right at the top of the program,

04:40.120 --> 04:43.880
it tells us exactly what this code is going to do, and it has this lovely little for loop,

04:43.880 --> 04:50.520
and then you write a whole bunch of C to do that. I don't know about you, I don't, I like C,

04:50.520 --> 04:55.240
I don't want to write that. I'd rather write chapel. This is not the whole chapel program,

04:55.880 --> 05:01.080
but this is the core piece of the computation right here, is one single loop, this does distributed

05:01.160 --> 05:08.760
computing, this does shared memory, parallelism, all in one little loop. So that's pretty cool,

05:09.880 --> 05:17.080
even better. So up on this graph is better, chapel completely outperforms MPI in this case. And

05:17.080 --> 05:23.640
part of the reason for this is aggregation. So in this MPI, you know, something to get that aggregation,

05:23.640 --> 05:29.400
it takes a little bit more work. In chapel, you sort of get it for free. You do have to add a

05:29.480 --> 05:34.040
little bit extra, but you sort of get it for free, and so you get much better performance with chapel.

05:35.720 --> 05:42.040
Yes, I know that seems unreal, it's real. How does this actually all get to happen?

05:43.480 --> 05:49.640
There's kind of two concerns in chapel. One of them is the parallelism, you know, what you're

05:49.640 --> 05:57.480
actually computing with. So I can compute with a single CPU on a single node. I can compute

05:57.560 --> 06:03.240
with all of my CPUs on a single node, and we all love our GPUs, and I can drive the GPU.

06:03.240 --> 06:08.360
Right, so there's my parallelism, but I'm still only on one node, which brings in the other concept

06:08.360 --> 06:14.920
of locality, is where do I actually put my tasks, and where is that data? So I can have one task on one

06:14.920 --> 06:21.080
node, I can have one task on all of my nodes, I can have all of my tasks on all of my nodes,

06:21.880 --> 06:27.240
and of course I can have my GPUs. And then same thing with the data, I can just have the data

06:27.320 --> 06:32.440
on one node, and every single node calls back to the first node to see the data.

06:33.800 --> 06:37.320
I don't want all that communication, so I can replicate the data across all of my nodes,

06:38.280 --> 06:44.200
or I can partition it, so I'm not wasting space on each node. Each node only has the piece that needs,

06:44.200 --> 06:48.520
if it needs like boundary conditions, it can just talk to the other nodes and get that data back.

06:49.240 --> 06:52.520
And again, getting the theme here, I can do the same thing with GPUs.

06:53.240 --> 06:59.640
So I'm not expecting everyone to be able to be chapel experts at the end of this talk,

06:59.640 --> 07:03.880
but I am going to look at a little bit of the chapel code here and kind of walk through how this

07:03.880 --> 07:10.280
execution is going to work. So all chapel programs are going to start on locality 0. This is

07:10.280 --> 07:16.920
different than MPI where everybody starts up at once. And variables are stored wherever you are.

07:16.920 --> 07:24.120
So in this case, we're on locality 0 on our current task, so that's where array is going to be allocated.

07:25.880 --> 07:30.360
I can then run a serial loop over all the locals, the nodes in my system,

07:31.240 --> 07:36.520
and I'm going to go through those nodes, and I'll move computation to the target locale.

07:36.520 --> 07:42.520
So in this case, we are moving to locality 0, which we're already there, so I think really happens.

07:42.520 --> 07:51.960
And then, because we're now on locality 0, when we allocate a new array, so var b equals a,

07:52.920 --> 07:59.480
it's going to be duplicated on locality 0. And I can access it just like I would write a variable.

07:59.480 --> 08:03.400
I don't have to do an MPI call, I don't have to do anything special, it's just there.

08:05.560 --> 08:11.160
Then I can kind of go through all my GPUs, and I can move execution to my GPU just like I would

08:11.160 --> 08:17.400
move to a compute node. The GPUs are treated as what's called a sublocal. So you have your top level

08:17.400 --> 08:25.880
locale, and that contains however many GPUs you might have. And again, the variables in scope,

08:25.880 --> 08:32.840
I can access the variable and create a GPU array. So now b is my CPU array, c is my GPU array,

08:32.840 --> 08:37.960
and have allocated the data. And I'll keep going through this program, just kind of walk through things.

08:38.680 --> 08:45.640
All right, I go and I move to the second iteration of the GPU loop. I move computation to my second

08:45.640 --> 08:52.360
GPU, and then I get my second GPU array. Notice that the GPU array disappears, the previous one,

08:52.360 --> 08:58.040
right, it's going out of scope, it's no longer used, it goes away. And we continue over to locale one,

08:58.040 --> 09:06.680
we move computation, we get our array, and we get our GPU array, and then our second GPU array.

09:07.640 --> 09:14.280
So this is all of this is serial. This is distributed computation, but it is serial. So that's

09:14.280 --> 09:19.240
this is one part of the feature set. The other part of the feature set is adding in the parallelism.

09:19.240 --> 09:24.280
So this is mostly the same code, and I'll call out the differences here. So instead of using a

09:24.280 --> 09:31.160
for loop, so serial iteration, I have this co-for-all loop, which is a single task, or it's parallel

09:31.240 --> 09:36.680
task per iteration. So every time this loop will go, I'll get a new task. So if I have two

09:36.680 --> 09:44.920
locals, the first co-for-all gets two tasks. The second part of this is I have this co-began statement.

09:44.920 --> 09:51.960
So for each child's statement, I will get a parallel task. And so that's how I can drive both

09:51.960 --> 09:58.760
the CPUs and the GPUs at the same time. All of this results in everything being driven at once.

09:58.760 --> 10:04.680
I have my GPU arrays, my CPU arrays, everything's happening all at once distributed and parallel.

10:06.440 --> 10:12.760
This is what I would call low-level chapel. In chapel, I want to be able to just write this

10:12.760 --> 10:18.760
declarative program. I'm not going to talk about this too much. Just know that we have this for all

10:18.760 --> 10:23.800
loop. It has this really fancy definition. All it means is the for all loop is magical, and it

10:23.800 --> 10:32.440
gives you good performance. It will do the right thing. You can write this simpler. This one's

10:32.440 --> 10:38.760
a little bit more easier to explain. But for all loop, it's magical. It does the right thing.

10:39.880 --> 10:43.960
It will give me single node parallel computation when I give it a single node array.

10:44.840 --> 10:48.680
It will give me distributed parallel computation when I give it a distributed array.

10:49.640 --> 10:56.280
And it will give me a GPU kernel when I give it a GPU array. All from one loop, all from one generic

10:56.280 --> 11:00.680
function, all of this happening at compile time, so without runtime overhead.

11:02.920 --> 11:08.440
And we have this lovely loop, but maybe I don't want to rewrite all of my code into chapel.

11:08.440 --> 11:13.160
I want to use interoperability. So instead of calling chapel code, I can call it an external procedure.

11:13.720 --> 11:18.440
It can be C, Fortran, whatever. But it doesn't even have to be external. If you really,

11:18.440 --> 11:22.920
really want to just write your C code in your chapel program, it's just kind of a nice little

11:22.920 --> 11:28.280
feature. And then again, you know, this is now working with distributed arrays, local arrays.

11:28.280 --> 11:33.240
The for all loop takes care of all of the communication, all of the chunking up of tasks,

11:33.240 --> 11:39.080
and then you can just do your math in whatever language you want. So I've kind of showed some

11:39.160 --> 11:43.000
benchmarks. Anyone can write a good benchmark. Let's talk about real applications.

11:44.200 --> 11:49.080
I'm going to talk about two today. Well, two ones that different ends of this spectrum, and then

11:49.080 --> 11:55.640
we'll get into Arcuda, starting with the bigger one, which is champs. This is a 3D unstructured

11:55.640 --> 12:04.280
CFD framework, or computational fluid dynamics. It's very big. It was written by Professor Eric Lando

12:04.280 --> 12:11.960
at Polytechnic, Montreal. And the reason they want to use chapel is because they can have

12:11.960 --> 12:16.680
students, undergraduate students, master students, instead of spending most of their time learning

12:16.680 --> 12:21.160
how to write good NPI C code, they just write chapel code, and then they get good performance.

12:22.040 --> 12:26.280
They don't have that big learning curve, and so they can get right to doing science from the

12:26.280 --> 12:31.880
get go. The other thing is that because it's performant, they're still competitive with other

12:31.880 --> 12:37.720
centers and other solvers, so they're not giving something up by using this productive code.

12:39.720 --> 12:45.320
On the other end of this spectrum, there's this biodiversity kernel, measuring core

12:45.320 --> 12:50.840
reef diversity. Basically, this is just doing a whole bunch of convolutions, stencil code,

12:50.840 --> 12:57.800
over some images. It's a much smaller chapel application. It was written by Scott Bachman

12:57.800 --> 13:04.840
and Associated Authors. The reason why they were using chapel is they were using MATLAB,

13:04.840 --> 13:11.400
and it was taking them months to do their code. It was very complicated. It was rewritten in chapel,

13:11.400 --> 13:15.720
and they immediately saw performance improvements because they weren't using MATLAB. The other

13:15.720 --> 13:22.200
thing is now that their computation was much more cleanly expressed in chapel, they could actually

13:22.200 --> 13:27.240
do algorithmic improvements that they couldn't even fathom in MATLAB before, just because the

13:27.240 --> 13:33.160
code was easier to read. They got algorithmic improvements on top of using chapel. Then,

13:33.160 --> 13:40.520
because chapel has GPU support adding a little bit extra code, they could do GPU, enabled code,

13:40.520 --> 13:48.520
and they ran this on front here, and going from four weeks of a desktop run to 20 minutes on

13:48.520 --> 13:55.480
a handful of nodes, not only doing science much faster, but it far bigger scales than they ever thought

13:56.360 --> 14:02.680
and this was with a very small amount of chapel. So, kind of two end of this spectrum of

14:02.680 --> 14:09.080
productive applications. Again, why would you want to use this? Get out of the way of the science,

14:09.080 --> 14:14.440
like just express what you want to say and stop worrying about managing the memory, stop worrying

14:14.440 --> 14:19.720
about the communication. There's many languages that are good at this, but to get that performance,

14:19.720 --> 14:23.320
you might have to give up those nice abstractions and Python if you want good performance,

14:23.400 --> 14:31.560
you're ready to see. It's easier to read, easier to maintain. You can get people started on your

14:31.560 --> 14:36.360
code earlier. You don't have to train them in a new technology. You have to train them in a new

14:36.360 --> 14:44.680
technology. You don't have to train them in a difficult technology, I hope. All of this leads me to

14:44.680 --> 14:50.120
Arcuda. Sometimes it's hard to get people to get out of their Python and their old ecosystems.

14:51.080 --> 14:57.320
So, you want to reach people where they already are, and so you want Python at scale.

14:58.520 --> 15:03.640
So, this is the motivation for Arcuda. You have people who already know how to use Python,

15:03.640 --> 15:10.680
but you also have these big problems that you need to solve. So, Arcuda, there's kind of a lot of

15:10.680 --> 15:17.880
definitions for what Arcuda might be. Basically, you have a Python client, a chapel server,

15:17.960 --> 15:23.240
they talk to each other. The chapel server runs on all your nodes. The Python client is just

15:23.240 --> 15:30.360
NumPy or pandas. So, you might say this is scalable NumPy and pandas. You might say it's a framework

15:30.360 --> 15:35.320
for driving super computers. We would like to expand this beyond just NumPy and pandas to insert

15:35.320 --> 15:42.120
your favorite framework here. That's all lovely and good. Performance is more interesting.

15:42.520 --> 15:49.480
This is a comparison against kind of competing technologies for Arcuda, like DASC and Spark.

15:49.480 --> 15:57.960
Doing exploratory data analysis. This particular breakdown is of a workflow for telemetry analysis

15:57.960 --> 16:03.480
from Frontier. So, they have a lot of telemetry data that they want to do science on it.

16:04.280 --> 16:09.640
And this is a breakdown of different steps in, say, a Jupiter notebook. And we've kind of

16:09.640 --> 16:14.920
arbitrarily drawn this line of interactivity of about two minutes. So, if you're waiting

16:14.920 --> 16:20.680
more than two minutes, your thoughts might have disappeared. So, you want interactivity,

16:20.680 --> 16:27.240
you want results back quickly. As you can see, this graph lower is better. For the most part,

16:27.240 --> 16:32.600
chapel does a much better job. Right up front, you can see that chapel is much slower because

16:32.600 --> 16:37.320
Arcuda is going to Chapel Arcuda. It's going to eagerly read in all of that data from

16:37.320 --> 16:44.760
Parquet files. But then, once it's read in all that data files, it's just significantly faster

16:44.760 --> 16:55.000
to do computation. Part of this is because Arcuda joins across nodes are much more efficient

16:55.560 --> 17:01.160
because there's much better computation because this is, Arcuda is running on top of a built

17:01.240 --> 17:08.200
for HPC technology chapel, whereas desk and spark are not. The other thing that I will add to this

17:08.200 --> 17:13.160
is I did not personally run these results. The person who did knows a whole lot more about

17:13.160 --> 17:20.360
data science than I do. But they had a much easier time getting chapel and Arcuda set up,

17:20.360 --> 17:25.560
then they did trying to wrangle a desk server and a spark server. It was much easier to just launch

17:25.720 --> 17:33.960
the Arcuda server and do the job. All of this chapel is built from the ground up for productive

17:33.960 --> 17:40.760
parallel computing. The language features let you express your computation and it meets or beats

17:40.760 --> 17:46.600
the low level approaches while still being easy to read, understandable, maintainable,

17:48.200 --> 17:53.480
and without all of that extra boilerplate. The last thing I'll leave you with is this quote from

17:53.560 --> 17:58.520
one of the original authors of Arcuda. It's on the verge of resigning myself to learning

17:58.520 --> 18:02.760
MPI when I first encountered chapel. After writing my first chapel program, I knew I had found

18:02.760 --> 18:07.320
something much more appealing. Why would you want to write MPI when you could write chapel?

18:10.040 --> 18:14.840
Chapel is open source. We would love to interact with you. If you saw something you liked in this

18:14.840 --> 18:19.000
presentation, please come tell us about it. If you saw something you didn't like, please come tell us

18:19.000 --> 18:25.560
about it. Come code with us, interact with us, help us make chapel better for your particular use

18:25.560 --> 18:31.400
case. Thank you all very much. I'll take any questions now.

18:38.680 --> 18:42.360
So the question was what's the out of core computing capability? Can you define what you mean by out

18:42.360 --> 18:57.000
of core? So chapel does not, out of core by the amount of, can you put arrays that don't fit in

18:57.000 --> 19:04.360
memory to disk? Chapel does not provide a built-in ability to do this, but you can just take your array

19:04.360 --> 19:10.040
dump it to IO and then read it back in. So there's no built-in way in chapel to do this, but it's

19:10.680 --> 19:17.320
trying complete language you can do it you want with a parallel IO to speed up that computation.

19:32.600 --> 19:39.160
So the question was, is there already a implemented? The short answer? Yes. The chapel run time

19:39.160 --> 19:47.000
interacts with the various networks and we'll do RDMA for faster memory accessing. Yes.

20:09.480 --> 20:18.600
So the question is, are we using the topology of a given run to do any kind of optimizations?

20:19.320 --> 20:28.760
Answer is sort of. For the most part, we just kind of assume a heterogeneous allocation of nodes.

20:28.760 --> 20:35.080
So we're not doing any kind of dynamic changing of things. However, what you can do is say you have

20:35.960 --> 20:40.920
multiple nicks, multiple GPUs, and you want the locales to kind of splice them up. You can use

20:40.920 --> 20:47.000
something called colocals and have multiple locales in a given node so that you can kind of divide

20:47.000 --> 20:51.560
up those resources a little bit more, but everything is still kind of assumed to be very homogeneous.

20:54.680 --> 20:56.920
Any other questions? Yes.

20:57.880 --> 21:07.160
Does it run up the cloud? The question was, does it run on the cloud? Yes. I have done testing on AWS

21:07.960 --> 21:15.080
using parallel cluster, P cluster works great. I know of some people who have used it with

21:16.040 --> 21:22.040
believe it was GCP. Worked great. I see no reason it wouldn't.

21:39.080 --> 21:43.240
The question was, can you control where memory is loaded, particularly on the GPU?

21:44.200 --> 21:49.160
I mean, you can control which GPU it is, but exactly where in the GPU.

21:51.160 --> 21:59.080
Totally. The one thing you can do is interrupt with coulda, hip, whatever, and do whatever they can do.

22:01.240 --> 22:02.040
Yes.

22:13.240 --> 22:37.880
So the question was, when you do something in NumPy, it's the NumPy operations

22:37.880 --> 22:42.360
of vectorized, but if you try and not use those vectorized operations, it starts to get ugly.

22:42.360 --> 22:48.440
It's very similar in our CUDA. Don't try and iterate over in our CUDA array and pull all of your data

22:48.440 --> 22:51.720
from your remote node to your client. So, yes.

23:07.880 --> 23:29.080
So, the question is, when you're doing interoperability, does that included code get optimized by

23:29.080 --> 23:39.080
L of the end at the back end? Is that correct?

23:41.320 --> 23:47.720
A little bit, the question was, is it a DLL or is it optimized by a compiler?

23:48.600 --> 23:53.000
So, the example that I showed specifically with that external block, that's going to get optimized

23:53.000 --> 23:59.640
together. So, there's kind of no external calls. If you're calling out to a pre-compiled object

23:59.640 --> 24:03.880
or a shared library, yes, there's like a foreign function interface passing between the two.

24:13.720 --> 24:18.120
I'm not sure I can answer that well. We can talk after if you want.

24:18.120 --> 24:25.400
Any other questions? All right. Thank you all very much.

