WEBVTT

00:00.000 --> 00:11.000
So, as my day job, I work at San Pupari, which is like a political science school.

00:11.000 --> 00:13.000
Nobody is being EBBFS political sciences.

00:13.000 --> 00:15.000
I can tell you that.

00:15.000 --> 00:19.800
Most of this work, I've done like in my previous job where I was working as a national

00:19.800 --> 00:21.800
HPC center of France.

00:21.800 --> 00:25.200
So, just like a first question, how many of you know what is HPC?

00:25.200 --> 00:27.200
High performance computing?

00:28.200 --> 00:30.200
Okay, that's not too bad.

00:30.200 --> 00:33.200
Thanks to you, I know everybody is getting to know what is HPC.

00:33.200 --> 00:35.200
So, just like to give you for other people.

00:35.200 --> 00:39.200
Because my presentation is going to be a bit like focused on HPC.

00:39.200 --> 00:44.200
I'll do like a short introduction of what is HPC is like high performance computing.

00:44.200 --> 00:46.200
This is out of data center.

00:46.200 --> 00:50.200
The one thing that is different from regular data center to the HPC clusters,

00:50.200 --> 00:52.200
is like the interconnect network.

00:53.200 --> 00:56.200
HPC clusters is like a niche network.

00:56.200 --> 00:59.200
We just call it RDME, a remote direct memory access.

00:59.200 --> 01:05.200
That gives you like a micro-second latency, which is not possible with Ethernet.

01:05.200 --> 01:10.200
That's what makes HPC clusters different from regular data centers.

01:10.200 --> 01:17.200
And thanks to the HPC, like RDME, all this crazy huge AI models can be trained on like thousands

01:17.200 --> 01:20.200
of nodes passing the messages between like thousands of nodes.

01:20.200 --> 01:23.200
I talk to each other with small messages.

01:23.200 --> 01:29.200
And that's actually being done using the MPI, which is another very niche HPC thing.

01:29.200 --> 01:36.200
It's a software that can be able to launch your application on thousands of nodes

01:36.200 --> 01:41.200
and able to talk to each other in very, very small messages.

01:42.200 --> 01:48.200
So yeah, just a little bit of the context as I was telling before,

01:48.200 --> 01:50.200
like AI is transforming HPC.

01:50.200 --> 01:51.200
Yeah, thanks to AI.

01:51.200 --> 01:55.200
Like people are getting more and more about HPC because like five years back,

01:55.200 --> 01:58.200
HPC such a niche subject, not a lot of people who knew it.

01:58.200 --> 02:00.200
That's done.

02:00.200 --> 02:04.200
And yeah, most of the HPC-traditional tools are like very focused on MPI.

02:04.200 --> 02:10.200
This tool that I'm putting about like, you know, which is used to spawn like application.

02:10.200 --> 02:13.200
On thousands of nodes at the same time.

02:13.200 --> 02:19.200
Or very focused on HPC work course and they will not work for the AI work course out of the box

02:19.200 --> 02:23.200
because most of the AI work course are like a package with Python work.

02:23.200 --> 02:27.200
Of course, the GPUs are like the beast.

02:27.200 --> 02:29.200
They can do complete very, very, very fast.

02:29.200 --> 02:33.200
Most of the time they're limited by the IGO is, you had to feed the beast.

02:33.200 --> 02:36.200
The feeding of the beast is by the storage.

02:36.200 --> 02:41.200
And like, if you don't have the fast enough storage, it doesn't matter if you have H100's B100's here.

02:41.200 --> 02:45.200
It's not, you're never going to use like the potential.

02:45.200 --> 02:50.200
And again, one more thing about like a HPC system is like a parallel file system,

02:50.200 --> 02:54.200
a parallel distributed file system, which I really really call them.

02:54.200 --> 02:59.200
So for example, if you're aware of like a AWS FSX, it's called, it's based on LUS,

02:59.200 --> 03:01.200
which is a parallel file system.

03:01.200 --> 03:04.200
And it can offer unprecedented level of performance, again,

03:04.200 --> 03:09.200
like thanks to RDMA, that the niche network I'm talking about, like everyone direct memory access.

03:09.200 --> 03:12.200
And we have like a different types of parallel files system.

03:12.200 --> 03:17.200
As a moment, like for example, LUS is one of the biggest ones, like very well known one.

03:17.200 --> 03:19.200
Most of the HPC clusters use them.

03:19.200 --> 03:22.200
And also we have like spectrum scale from IBM, like BGFS.

03:22.200 --> 03:24.200
We have like CFFS as well.

03:24.200 --> 03:28.200
I mean, which is no very high performance, but this is pretty much used.

03:28.200 --> 03:31.200
So every files system, they bring their own telemetry.

03:31.200 --> 03:38.200
Like it's almost impossible to find a common ground to get like a common matrix between all these different file systems.

03:38.200 --> 03:40.200
And this is a huge problem.

03:40.200 --> 03:43.200
One of the biggest problems on the HPC community is always the fragmentation.

03:43.200 --> 03:47.200
Everybody does their own thing, there's no standardization.

03:47.200 --> 03:53.200
And yeah, most of the HPC tools, like standard tools, like Dush and Dush and Dush,

03:53.200 --> 03:58.200
and it's like I will monitor and build again, made for the HPC MPI.

03:59.200 --> 04:07.200
It's strongly coupled only for the MPI works, and it doesn't work for all sorts of applications that use HPC, HPC platforms.

04:07.200 --> 04:15.200
So yeah, there are these few of the problems that we need to deal with, like when we thought about like traffic profiling IW.

04:15.200 --> 04:19.200
So ideally what we would be like to have, right?

04:19.200 --> 04:23.200
I mean, we want to monitor the applications,

04:24.200 --> 04:27.200
and the file system that we're using.

04:27.200 --> 04:32.200
The portability is very important, and the standard standardization is quite important in the HPC as well.

04:32.200 --> 04:34.200
Which doesn't exist at the moment.

04:34.200 --> 04:38.200
And like let's be honest, like nobody wants to change that course, people are lazy.

04:38.200 --> 04:43.200
That's the reality I can have except it, like asking people to recompile their course,

04:43.200 --> 04:46.200
that instrument that course is never going to be worked.

04:46.200 --> 04:49.200
This is an option, it's never going to work in the real world.

04:50.200 --> 04:53.200
And of course, it has to be minimally increasing and negligible overhead.

04:53.200 --> 04:58.200
I mean, because like at the end of the day, we want to debug like a performance issues in the production,

04:58.200 --> 04:59.200
not on our laptop.

04:59.200 --> 05:04.200
It doesn't make any sense like, well, it doesn't give you a lot of interesting information.

05:04.200 --> 05:06.200
Being just out of, I will profiling on your laptop.

05:06.200 --> 05:10.200
So we have to do it on the production, and so it has to be minimally increasing.

05:10.200 --> 05:13.200
It should have a lot of work ahead.

05:14.200 --> 05:17.200
And like, yeah, if you can capture that, you will even set any level of the stack.

05:17.200 --> 05:19.200
That's even better.

05:19.200 --> 05:22.200
And that's what like, EBPF can't offer.

05:22.200 --> 05:25.200
I think I don't have to really do this like, because like,

05:25.200 --> 05:29.200
you say EBPF development, I think everybody's ever wants EBPF use.

05:29.200 --> 05:34.200
So yeah, I mean, so how we use like the EBPF to do this, I will monitoring,

05:34.200 --> 05:40.200
just to show you like the tracing of the VFS file system.

05:40.200 --> 05:44.200
So just to give you an idea how the, how the IO happens in the Linux is like,

05:44.200 --> 05:48.200
whenever the process stops to the, once you do like IO,

05:48.200 --> 05:51.200
either read or write CISCOS.

05:51.200 --> 05:54.200
They're actually passed through VFS layer, which is virtual file system layer,

05:54.200 --> 05:58.200
which is like the interface that like every file system has to, has to implement,

05:58.200 --> 06:01.200
to be able to actually talk to the hardware.

06:01.200 --> 06:04.200
So for example, like beta local file system, beta remote file system,

06:04.200 --> 06:08.200
like last VGF spectrum scale, BTFS,

06:08.200 --> 06:13.200
they all have to implement like the interfaces defined by the VFS.

06:13.200 --> 06:17.200
And like, it's the VFS responsibility looking at the file,

06:17.200 --> 06:21.200
the script of the file that wants to do the IO is going to delegate the call

06:21.200 --> 06:27.200
to the correct driver that actually does the real IO.

06:27.200 --> 06:34.200
So if you can actually monitor the VFS layer of the, of the, of the kernel

06:34.200 --> 06:37.200
by a by injecting like a, a, a, bf programs,

06:37.200 --> 06:42.200
we can actually monitor all the IO that's happening through your kernel, right?

06:42.200 --> 06:46.200
So for example, here is a bunch of the functions that actually,

06:46.200 --> 06:50.200
we can, we can use to monitor the IO for a VFS read,

06:50.200 --> 06:53.200
it's like to read the file, write the file, open create,

06:53.200 --> 06:55.200
make it, and we can remove it.

06:55.200 --> 06:57.200
And based on the version of the kernel,

06:57.200 --> 07:00.200
there are some variants of the read and write that we need to,

07:01.200 --> 07:04.200
to monitor instrument to be, but like,

07:04.200 --> 07:08.200
globally these are like the most important functions that by by by by by

07:08.200 --> 07:13.200
instrumenting these functions, we can, we can capture all the IO activity

07:13.200 --> 07:17.200
on the on the on the given one from the kernel from the kernel point of view.

07:17.200 --> 07:21.200
And this is pretty much the basic idea of what we, what we're doing.

07:21.200 --> 07:24.200
So just to show you like a sample code, like,

07:24.200 --> 07:28.200
the headers and stuff like it's, you know, it's pretty standard.

07:28.200 --> 07:32.200
So one thing I would like to talk here is like,

07:32.200 --> 07:35.200
you know, the resource managers,

07:35.200 --> 07:38.200
doesn't matter which are resource managers that you're using,

07:38.200 --> 07:41.200
for example, be like a HPC basket, you just like slum,

07:41.200 --> 07:46.200
or cloud VMMs, like open stack, or even Kubernetes.

07:46.200 --> 07:50.200
To be able to manage the resources, they use kernel C groups.

07:50.200 --> 07:53.200
So like each C group means a part,

07:53.200 --> 07:57.200
the Kubernetes context or virtual machine in the open stack context,

07:57.200 --> 08:03.200
or like a job, a batch of in the in the slum HPC batch system context.

08:03.200 --> 08:08.200
So by by monitoring like the C groups and the mount point,

08:08.200 --> 08:12.200
because like most of the time, like the the file systems are mounted

08:12.200 --> 08:17.200
on the given given paths on the file and the links server.

08:17.200 --> 08:22.200
So by by by by by keeping track of the combination of the C group

08:22.200 --> 08:25.200
and on the mount point, we should be able to track,

08:25.200 --> 08:29.200
like the iO activity of a given application of a given workload,

08:29.200 --> 08:32.200
for example, for a given point in the Kubernetes context,

08:32.200 --> 08:35.200
let's say, onto a given file system.

08:35.200 --> 08:39.200
So that's what we're going to track as a VMFs event key,

08:39.200 --> 08:41.200
what we've seen here.

08:41.200 --> 08:45.200
And then like the, the even itself is like the number of bytes that,

08:45.200 --> 08:49.200
like the, the, the, the, the part is actually writing to the given file system,

08:49.200 --> 08:52.200
a number of calls and number of various engineering errors.

08:52.200 --> 08:55.200
These are the structs we define and we put them into a,

08:55.200 --> 09:00.200
a bit of map right on in with with this event key as a map key,

09:00.200 --> 09:07.200
and then the, the even as like a value of the value of the, of the map.

09:07.200 --> 09:12.200
And here is like the most of the heavy heavy work has been done basically.

09:12.200 --> 09:16.200
Like we take the, we take the VFS right, for example,

09:16.200 --> 09:21.200
it has like a file pointer, we take that as like input inside of function.

09:21.200 --> 09:26.200
We get the number of bytes that they actually file is attempting to write to the file system.

09:26.200 --> 09:29.200
This is pretty much easy, it's very easy to get the thing.

09:29.200 --> 09:32.200
The most of the heavy work is done in this, in this part actually.

09:32.200 --> 09:37.200
So EVP of helper functions gives like a C group ID for the V2 out of the box,

09:37.200 --> 09:41.200
which is pretty much easy, but for the V1, there are no helper functions.

09:41.200 --> 09:45.200
So I've written like another, another function that actually can work for,

09:45.200 --> 09:50.200
but for the V2 and V1 to get like a C group ID.

09:50.200 --> 09:54.200
And this, this line here, what we're doing is like get mount path.

09:54.200 --> 09:58.200
Looking at the file struct, we have to get the mount point,

09:58.200 --> 10:02.200
and this is actually most tricky part, let it look wild for me to get it right.

10:02.200 --> 10:09.200
Because within the file struct, we have to recursively go back in the, in the, in the, in the paths,

10:09.200 --> 10:14.200
to actually able to mount this mount the, get the mount path of the, of the file.

10:14.200 --> 10:17.200
And because there's like very little bit of support for the loops.

10:17.200 --> 10:21.200
In the EPPF and it's recursively loop, I have to do this.

10:21.200 --> 10:25.200
It took a while for me to get everything right, but like, it's been working well.

10:25.200 --> 10:27.200
Since since I got it right.

10:27.200 --> 10:32.200
And then the last part is pretty much simple, like it's very standard, like you, you make an event,

10:32.200 --> 10:36.200
you like, there is no event, you create a new event, and like this already event,

10:36.200 --> 10:41.200
you just like increase the increase the calls and the bytes and errors as well.

10:41.200 --> 10:42.200
I'm missing that here.

10:43.200 --> 10:48.200
It's pretty much simple, but as I said, like this, this middle part is like most of the heavy lifting,

10:48.200 --> 10:54.200
and this is where most of the, most of the computation time is happening.

10:54.200 --> 11:04.200
So, well, to able to do the test, the approach, like, I did, like, a test using an NFS server,

11:04.200 --> 11:08.200
and then like, I created, like, I took, like, an NFT server to do the test,

11:08.200 --> 11:10.200
so that nobody is using it as only for me.

11:11.200 --> 11:17.200
I created NFS and, like, mounted onto to the same server to remove any sort of network noise.

11:17.200 --> 11:24.200
And I used Iowa benchmark again, Iowa benchmark is the one that's heavily used in the HPC HPC environment,

11:24.200 --> 11:33.200
and also, like, any sort of platform, I guess, to be able to test the, Iowa performance of the underlying file system.

11:34.200 --> 11:44.200
So, yeah, onto the, on this NFS server, like, so we, I tested, like, with the different transfer sizes of 1, 2, 4, and 16 megabytes.

11:44.200 --> 11:52.200
I think, like, well, I don't remember how much, like, the entire data is, like, 10 gigabits, 20 gigabits of data in this sort of transfer sizes.

11:52.200 --> 11:57.200
For both POSIX and the MPI, okay, I'm not going to go into the MPI,

11:57.200 --> 12:03.200
Iowa MPI always, like, is a convenience wrapper for, like, MPI applications that enable,

12:03.200 --> 12:07.200
to, for the multiple processes from different nodes to be able to write to the same file,

12:07.200 --> 12:11.200
which is a given offset, it's like, the people from the MPI, they, they love to use MPI,

12:11.200 --> 12:18.200
because it's a convenience wrapper to be able, for the different processes to write to the same file.

12:18.200 --> 12:23.200
And then, like, most of the, this eBPS programs are implemented in,

12:23.200 --> 12:27.200
in, in a primitive, export, or, like, King's, export, or, actually, this is the work we've been doing,

12:27.200 --> 12:34.200
like, you know, in the HPC center, back then. And this export, actually, uses, like, the eBPS 2 to monitor the Iowa.

12:34.200 --> 12:40.200
And we, I, for the test, we come, I configured the Prometheus 2 to scrap, I will,

12:40.200 --> 12:46.200
2 seconds, intentionally, to create, like, heavy enough load, like, so that, like, every time the map has to be read and, like,

12:46.200 --> 12:51.200
the, the Canon has to take a spin log that, which is going to create an over, overhead.

12:51.200 --> 12:56.200
So, I've created, like, such a small scrap interval, just to create the overload artificially,

12:56.200 --> 13:01.200
to see if it's going to have any, any sort of any sort of influence on the, on the, on the performance.

13:01.200 --> 13:07.200
So, I'm looking at, like, two, two matrix, like, a relative overhead, which is basically,

13:07.200 --> 13:12.200
basically, the overhead reported by Iowa, without exported to the wheat exporter.

13:12.200 --> 13:17.200
So, higher, higher, higher, higher, it means, like, higher, higher,

13:17.200 --> 13:22.200
higher overhead. And then, relative, like, higher, higher overhead.

13:22.200 --> 13:26.200
And then, the relative error, like, basically close to one, is, like, the error is,

13:26.200 --> 13:31.200
basically, the bandwidth reported by the Iowa benchmark to the bandwidth reported by the,

13:31.200 --> 13:38.200
by the exporter from the Prometheus. And here are the results, like, the, the first column,

13:38.200 --> 13:41.200
which is the positive side one, the second column, is, like, the MPIow.

13:41.200 --> 13:44.200
And the top ones are, like, the relative overhead and the bottom ones are, like, relative

13:44.200 --> 13:50.200
overhead. So, for the, both and read and write, our operations.

13:50.200 --> 13:54.200
For the smallest transfers of one kilobite, I guess.

13:54.200 --> 13:58.200
Yeah. You can see, it's, like, actually, the mean value is, like, less than one,

13:58.200 --> 14:01.200
which doesn't make any sense, because, like, that means, like, with the exporter,

14:01.200 --> 14:04.200
the performance is actually much higher than without exporter.

14:04.200 --> 14:09.200
So, that tells me that, we are, like, in the floating point errors range,

14:09.200 --> 14:15.200
so, so, the, the, the overhead is not very measurable, let's say.

14:15.200 --> 14:19.200
But, like, if you see, like, all, all, all the values are hovering around one,

14:19.200 --> 14:25.200
which shows, like, the, the overhead is pretty much negligible or almost non-existent.

14:25.200 --> 14:30.200
And the, the same with the, relative error, all the values are pretty much hovering around one

14:30.200 --> 14:35.200
as well, like, from most of the, most of the things, which, which tells me, like, okay.

14:35.200 --> 14:39.200
Well, there's, there might be some more overhead, but it's not really measurable or not, really,

14:39.200 --> 14:43.200
not feasible.

14:43.200 --> 14:47.200
And yeah, folks from the HPC, like, to say, ah, everything works in the single node,

14:47.200 --> 14:51.200
because, like, it's such an easy configuration, like, we want to see, like, if the things

14:51.200 --> 14:55.200
work on the multi-node context, right? I mean, I don't really see why the,

14:55.200 --> 14:57.200
why it should not work on the multi-node context, because he, because he, because he,

14:57.200 --> 15:00.200
maybe, if it doesn't have to do with the single node or multi-node, but, like,

15:00.200 --> 15:03.200
anyway, I, I get, like, the test, like, all the multi-nodes is in, like,

15:03.200 --> 15:09.200
IOR, and the Jones-A, our, our HPC platform, which has around 2500 nodes.

15:09.200 --> 15:13.200
So, I did, like, a test, I don't, from the four nodes, with 16 MPI processes, like four nodes,

15:13.200 --> 15:16.200
four processes, four processes, on each node.

15:16.200 --> 15:20.200
With the transfer, says, of one megabyte, this is something, you know,

15:20.200 --> 15:25.200
to say, the default, like, the cluster has been using on the last Rafael system,

15:25.200 --> 15:27.200
with the Pozics in MPI-I-O.

15:27.200 --> 15:31.200
And Dersheng, which I've been talking about, like, Dersheng is, like, another standard,

15:31.200 --> 15:35.200
I will monitor and include that, like, the MPI folks use, like,

15:35.200 --> 15:39.200
so I use Dersheng to be able to validate my results from the exporter.

15:39.200 --> 15:43.200
And Dersheng shows the matrix for the, we've been into, well, for three,

15:43.200 --> 15:45.200
two seconds or something. This is something I couldn't change, like,

15:45.200 --> 15:49.200
is, is, is something of the hard-coated, like, within the Dersheng.

15:49.200 --> 15:53.200
And again, we, we use, like, a exporter, the kimchi's exporter that's been,

15:53.200 --> 15:56.200
running on the, on the, on the, on the, on the, on the, on the,

15:56.200 --> 16:00.200
HPC platforms, with this crap into, well, of 10 seconds here,

16:00.200 --> 16:03.200
I could not change, I could not, I could not reduce this crap into, well,

16:03.200 --> 16:06.200
because it's a production system, which has been running, like,

16:06.200 --> 16:11.200
so changing that to the lower, lower can, can, is, is complicated,

16:11.200 --> 16:14.200
because, like, as I said, we have, like, 2,500 nodes,

16:14.200 --> 16:18.200
and it's not that easy to, to, take the, take the, take the,

16:18.200 --> 16:22.200
take the, take the nodes out of the production to, to, to the test.

16:22.200 --> 16:27.200
So, what we did is, like, to validate the exporter, we just,

16:28.200 --> 16:33.200
compared the instantaneous bandwidth, reported by the promitius,

16:33.200 --> 16:35.200
with the ones that are reported by Dersheng, which is, like,

16:35.200 --> 16:37.200
more standard, too.

16:38.200 --> 16:41.200
So, what we see here, again, for the projects in MPI,

16:41.200 --> 16:46.200
will, the solid lines are from the EBPF approach and the,

16:46.200 --> 16:50.200
markers are from the Dersheng, like, the other software, like, from what we can see,

16:50.200 --> 16:52.200
is, like, the pretty close to each other for both,

16:52.200 --> 16:55.200
read and write operations.

16:55.200 --> 16:57.200
For me, you can, you can see, it's, like, one of the nodes,

16:57.200 --> 16:59.200
actually, I have degraded, I will perform,

16:59.200 --> 17:01.200
it's a correction, mission, things can happen wrong,

17:01.200 --> 17:04.200
but, like, both, like, the Dersheng and both,

17:04.200 --> 17:08.200
the, the exporter, based on the EBPF, the, the both, like,

17:08.200 --> 17:11.200
followed the trend pretty, pretty well.

17:11.200 --> 17:14.200
And, for both, projects, and MPI, go.

17:14.200 --> 17:17.200
So, that, that actually shows, like, the,

17:17.200 --> 17:20.200
and the exporter is working well, and the approach that we're using,

17:20.200 --> 17:23.200
the EBPF is working, as it should.

17:24.200 --> 17:26.200
Just to show you, for example, yeah, as it said,

17:26.200 --> 17:29.200
like, the exporter is running on, on, on the HPC platform,

17:29.200 --> 17:31.200
like, since, like, more than two years now,

17:31.200 --> 17:33.200
like, as it said, like, it has 2,500 nodes,

17:33.200 --> 17:37.200
and what we see here is, like, the, I will bang it from all,

17:37.200 --> 17:42.200
all the, all the, all the cluster, onto the different file system,

17:42.200 --> 17:44.200
we have, like, three different, two different file systems,

17:44.200 --> 17:46.200
like, we actually monitoring, like, one is a scratch,

17:46.200 --> 17:48.200
which is, like, hyper-form and file system,

17:48.200 --> 17:51.200
one is a work, which is, like, a, low-performing file system,

17:52.200 --> 17:54.200
from different partations, like, it has,

17:54.200 --> 17:57.200
the HPC cluster has different groups of nodes.

17:57.200 --> 18:00.200
So, from each group of nodes, what we see is, like,

18:00.200 --> 18:03.200
the enter, I will bang it.

18:03.200 --> 18:06.200
This is, like, the cluster side of sign, like,

18:06.200 --> 18:08.200
the whole cluster view, and, like, for example,

18:08.200 --> 18:11.200
like, we, we provide these matrix for the individual users,

18:11.200 --> 18:13.200
as well, for the jobs.

18:13.200 --> 18:16.200
So, for example, this part users can get it, like,

18:16.200 --> 18:20.200
for the, I will, I will operations for, like,

18:20.200 --> 18:22.200
for both read and write bandwidth and request,

18:22.200 --> 18:24.200
for, like, given job, for example,

18:24.200 --> 18:26.200
from the graph on a dashboards in real time.

18:26.200 --> 18:29.200
Well, we have, back then, we had users,

18:29.200 --> 18:32.200
which are, who are doing, like, a pretty high,

18:32.200 --> 18:35.200
intensive, I will operation, like, running course,

18:35.200 --> 18:37.200
that does, like, a lot of, I will operations.

18:37.200 --> 18:39.200
So, after we deployed the exporter,

18:39.200 --> 18:41.200
nobody actually came back to us saying,

18:41.200 --> 18:43.200
okay, I will performance got degraded,

18:43.200 --> 18:45.200
because of the, of the, of the,

18:45.200 --> 18:47.200
exporters who I think, like,

18:48.200 --> 18:51.200
we never had any, any, any, any, any complaints.

18:51.200 --> 18:53.200
So, the exporter is working very well,

18:53.200 --> 18:57.200
without having any noticeable performance difference.

18:57.200 --> 19:01.200
So, what else, what, what more can we do with the EPPF, right?

19:01.200 --> 19:04.200
So, yeah, again, like, I'm still talking about the MPI,

19:04.200 --> 19:06.200
because, like, this is the one that's used,

19:06.200 --> 19:09.200
heavily in the HPC, and now, EI, as well, like,

19:09.200 --> 19:13.200
and it's very important to be able to profile the MPI functions.

19:13.200 --> 19:16.200
We can actually trace the MPI libraries using, like,

19:16.200 --> 19:19.200
the EPPF as well, like, you know, by sending the,

19:19.200 --> 19:21.200
the events to the ring buffer.

19:21.200 --> 19:25.200
And, like, the, the, the importance of the ring buffer

19:25.200 --> 19:27.200
is, like, you know, it keeps track of,

19:27.200 --> 19:29.200
I mean, it keeps the order, even the,

19:29.200 --> 19:31.200
even the events coming from the multiple CPUs,

19:31.200 --> 19:32.200
and, like, in the HPC, of course, like,

19:32.200 --> 19:35.200
your job is using multiple CPUs as a given time,

19:35.200 --> 19:37.200
and, like, you know, events can be generated from any of the CPUs

19:37.200 --> 19:39.200
that your job is running at a given time,

19:39.200 --> 19:42.200
and it's very important to keep the order of the events.

19:42.200 --> 19:46.200
And, you can, if you can get these things sent to the OpenTea,

19:46.200 --> 19:49.200
OpenTea limit restake is, like, if you generate profile traces,

19:49.200 --> 19:52.200
like, I think, I did, like, I used the Jaguar, I guess,

19:52.200 --> 19:55.200
but this, for this, profile concept, yes.

19:55.200 --> 19:59.200
So, I really want to do, like, a very, very, very,

19:59.200 --> 20:02.200
a quick profile concept to see if I can actually,

20:02.200 --> 20:07.200
profile the MPI functions, and send them to a Jaguar

20:07.200 --> 20:11.200
to see the profile traces, the spans, like, for the, for the visualisation,

20:11.200 --> 20:14.200
for example, but, here's what we can get, like,

20:14.200 --> 20:16.200
you see, again, like, the same IWAT test, like,

20:16.200 --> 20:18.200
the IWAT test, like, how it works is, basically.

20:18.200 --> 20:20.200
The, the series of events is, like,

20:20.200 --> 20:23.200
IWAT test opens a file, write the data,

20:23.200 --> 20:25.200
the given amount of data, closes the files,

20:25.200 --> 20:28.200
and opens the same file, reads the data that he has written,

20:28.200 --> 20:30.200
and then closes the files.

20:30.200 --> 20:31.200
So, from what we're seeing here is,

20:31.200 --> 20:34.200
basically, the events, the MPI file open,

20:34.200 --> 20:38.200
and file set menu, like, and then MPI file write,

20:38.200 --> 20:41.200
what we're seeing, like, the profile the spans,

20:41.200 --> 20:44.200
and then, once the files has been written,

20:44.200 --> 20:46.200
like, MPI close, I think this is where, like,

20:46.200 --> 20:49.200
the sync is happening, that's where it's taken a long time.

20:49.200 --> 20:52.200
And then, like, MPI file open for the read,

20:52.200 --> 20:54.200
and then, at the end of the day, MPI file read,

20:54.200 --> 20:57.200
and then, MPI, this, like, it closes as we're probably,

20:57.200 --> 21:00.200
I couldn't get inside the screenshot, but, like,

21:00.200 --> 21:03.200
yeah, with the EPPF and, like, if you,

21:03.200 --> 21:06.200
I mean, if you can get the EPPF trace, like,

21:06.200 --> 21:08.200
the EPPF functions, and, to this traces,

21:08.200 --> 21:10.200
to something, like, Jagger,

21:10.200 --> 21:14.200
we can get this out of spans and profiles,

21:14.200 --> 21:16.200
like, in the real time, for the users,

21:16.200 --> 21:18.200
and it's, it's really important for the HPC people,

21:18.200 --> 21:21.200
because, like, HPC clusters consume a lot of energy,

21:21.200 --> 21:23.200
and, like, you know, if you can get this sort of information

21:23.200 --> 21:25.200
in the real time, if things are going bad,

21:25.200 --> 21:27.200
you can stop your job instead of, like,

21:27.200 --> 21:28.200
running the job till the end,

21:28.200 --> 21:31.200
and only after that release, and working things are bad.

21:31.200 --> 21:33.200
And, yeah, this, this, this, this, this,

21:33.200 --> 21:35.200
something, what, what more we can do,

21:35.200 --> 21:37.200
like, you know, it's only,

21:37.200 --> 21:39.200
for any of the HPC community comes in, like,

21:39.200 --> 21:41.200
that's, like, standardization of the monitoring,

21:41.200 --> 21:43.200
like, but still, like, just to fragment it,

21:43.200 --> 21:45.200
everybody does their own thing.

21:45.200 --> 21:48.200
Just to closing remarks, yeah,

21:48.200 --> 21:50.200
what we did is, like, zero instrumentation,

21:50.200 --> 21:52.200
I'm monitoring framework.

21:52.200 --> 21:55.200
It's true that, like, a spot only for the HPC here,

21:55.200 --> 21:58.200
but, like, there is no reason that it wouldn't work for, like,

21:58.200 --> 22:00.200
for the cloud, or, or, or, even,

22:00.200 --> 22:03.200
Kubernetes has a set, like, because I've tested it on the board,

22:03.200 --> 22:05.200
or open stack and Kubernetes, and it, it works,

22:05.200 --> 22:07.200
already box.

22:07.200 --> 22:09.200
Yeah.

22:09.200 --> 22:12.200
Well, the, the approach to, like, a leverage,

22:12.200 --> 22:14.200
as much as possible, the cloud needed to,

22:14.200 --> 22:16.200
because, like, I'm a very pragmatic person myself,

22:16.200 --> 22:18.200
like, I don't want to recreate the things just for,

22:18.200 --> 22:19.200
because of, like, creating, you know,

22:19.200 --> 22:20.200
so, most of the things we, we, we,

22:20.200 --> 22:22.200
we're using for the graphon and parameters,

22:22.200 --> 22:25.200
it's pretty good, like, as I said, like,

22:25.200 --> 22:27.200
we, we, we, we're using this export on,

22:27.200 --> 22:29.200
2500 nodes without any problems of, of the,

22:29.200 --> 22:32.200
of the scalability since two, two and a half years,

22:32.200 --> 22:34.200
and it, thanks to the permités community,

22:34.200 --> 22:36.200
because they, they did create a job, like,

22:36.200 --> 22:39.200
with the permités and, and it works very well.

22:39.200 --> 22:42.200
And, yeah, the CVPF also gives us a ability to trace

22:42.200 --> 22:45.200
any sort of function, any, any, any levels,

22:45.200 --> 22:49.200
so, that, that can, that can be, very, very useful,

22:49.200 --> 22:52.200
like, you know, in something in, in, in, in,

22:52.200 --> 22:53.200
in, in, in, in this is, in, in, in, in, in,

22:53.200 --> 22:55.200
in a, in a, in a domain like HPC,

22:55.200 --> 22:57.200
and yeah, we can combine with,

22:57.200 --> 22:58.200
we've, this continuous profiling,

22:58.200 --> 23:01.200
to give, like, a complex monitoring and profiling solution

23:01.200 --> 23:07.480
for the HPC platforms, not only, but also cloud and also

23:07.480 --> 23:09.560
Kubernetes platforms.

23:09.560 --> 23:13.600
And if you want to get the HPC, the EBPF course,

23:13.600 --> 23:15.440
and stuff like this, everything is on the GitHub,

23:15.440 --> 23:18.600
like you know, you can go get the have a look if you're in

23:18.600 --> 23:19.200
faster.

23:19.200 --> 23:20.600
That's it, we thank you.

23:20.600 --> 23:42.600
So thank you for your presentation.

23:42.600 --> 23:48.280
I have one question because you are profiling at the VFS

23:48.280 --> 23:49.280
level.

23:49.280 --> 23:54.680
The right or the read, but did you make some tests regarding the

23:54.680 --> 24:02.280
metadata, like start, scrolls, and just the open, and did you

24:02.280 --> 24:07.880
did you try to figure out what you could did?

24:07.880 --> 24:11.280
So I see, you don't have the stats here.

24:11.280 --> 24:12.280
So the stack, no, no.

24:12.280 --> 24:16.280
Yeah, high code, not the stack, but yeah, I think that can be done

24:16.280 --> 24:17.280
as well.

24:17.280 --> 24:23.680
Because on some workload, you can have this typical stuff running

24:23.680 --> 24:29.480
on with codes that can be really bad for the phase system,

24:29.480 --> 24:31.480
as we're parallel for some sort.

24:31.480 --> 24:33.480
Yes, yes, yes, yes, yes, yes, yes, yes.

24:33.480 --> 24:36.480
It can be done, thank you.

24:47.480 --> 24:49.480
Hello, thank you.

24:49.480 --> 24:56.480
You just monitored the throughput, but could you also monitor the

24:56.480 --> 25:03.480
time a single I-O operation was taking just to see if I have

25:03.480 --> 25:08.480
some blocking I-O somewhere.

25:08.480 --> 25:10.480
You mean like time stamps?

25:10.480 --> 25:15.480
Yeah, just with time stamps, I want to dump my buffer onto the file

25:15.480 --> 25:20.480
system and it takes normally 100 milliseconds, but once in a while

25:20.480 --> 25:27.480
it takes a full second, and I want to know why.

25:27.480 --> 25:31.480
Well, I mean as I said, for this sort of profile,

25:31.480 --> 25:38.480
you should profile the Cisco, the Cisco.

25:38.480 --> 25:39.480
Yeah, okay.

25:39.480 --> 25:43.480
You can get by by profiling the function.

25:43.480 --> 25:45.480
Yes, that's true.

25:45.480 --> 25:47.480
It should be able to get that information.

25:47.480 --> 25:48.480
Okay, thanks.

25:48.480 --> 25:55.480
We're getting the expense.

25:55.480 --> 25:57.480
That finished yet.

25:57.480 --> 26:02.480
More questions.

26:02.480 --> 26:04.480
Thank you for your talk.

26:04.480 --> 26:10.480
Can you track the I-O on a by file level?

26:10.480 --> 26:12.480
Yes, technically, okay.

26:12.480 --> 26:15.480
It's just like your map's going to be super huge.

26:15.480 --> 26:21.480
I mean, that's where like I hear like I used like the the mount point,

26:21.480 --> 26:25.480
like to aggregate all the, all the higher having on the mount point,

26:25.480 --> 26:30.480
but you can use like a file path, for example.

26:30.480 --> 26:35.480
If you do it on the trace, I guess there you could include the exact fine.

26:35.480 --> 26:36.480
Yeah, exactly.

26:36.480 --> 26:39.480
It's possible.

26:39.480 --> 26:44.480
You still have a couple of moments if anyone wants to ask some more questions.

26:45.480 --> 26:49.480
Looks like we could.

26:49.480 --> 26:50.480
All right, thank you.

26:50.480 --> 26:51.480
Thank you.

27:14.480 --> 27:18.480
Do you both want some coffee or something?

27:44.480 --> 27:54.480
Okay.

27:54.480 --> 27:59.480
Okay.

27:59.480 --> 28:04.480
Thank you.

28:04.480 --> 28:16.480
Thank you.

28:16.480 --> 28:19.480
So we're just waiting for a few minutes more.

28:19.480 --> 28:21.480
As you remember, we still have that of stickers.

28:21.480 --> 28:24.480
PPS stickers at the front if you want to.

28:24.480 --> 28:27.480
To refill your stock of stickers.

28:27.480 --> 28:33.480
Not that stickers are lacking it for some usually, but never.

28:33.480 --> 28:36.480
That's why we have to clean it up again.

28:36.480 --> 28:37.480
That's right.

28:37.480 --> 28:40.480
So help me get some stickers.

29:04.480 --> 29:08.480
Yeah.

29:08.480 --> 29:15.480
So who has experience with manipulating strings in BPS programs?

29:15.480 --> 29:21.480
Yeah, a couple of people.

29:21.480 --> 29:30.480
The good way will be after the presentation.

29:30.480 --> 29:31.480
All right.

29:31.480 --> 29:37.480
So Victor's going to talk about simplifying the string and link in BPS programs.

29:37.480 --> 29:38.480
So we start.

29:38.480 --> 29:42.480
Take it away.

29:42.480 --> 29:43.480
Thanks.

29:43.480 --> 29:45.480
So can you carry this?

29:45.480 --> 29:46.480
Cool.

29:46.480 --> 29:47.480
So hi.

29:47.480 --> 29:48.480
I'm Victor.

29:48.480 --> 29:50.480
I work for the run hat.

29:50.480 --> 29:55.480
And yeah, as kind of said, I'd like to talk about handling strings.

29:55.480 --> 30:03.480
And you'll be able to program something that has been kind of clumsy over the years.

30:03.480 --> 30:09.480
So I'll talk about these trink hay fangs we've introduced in the past year.