WEBVTT

00:00.000 --> 00:11.000
All right, thank you for the second presentation of God, something about, again, some

00:11.000 --> 00:18.000
written of experience, but working with BPA to detect all the MA device driver bugs in real time

00:18.000 --> 00:22.000
from Procure and Maxine.

00:22.000 --> 00:24.000
So just from Maxine.

00:24.000 --> 00:37.000
Thank you, you go ahead.

00:37.000 --> 00:42.000
And if everyone can kind of just scoop to the middle, we are not allowed to have people standing

00:42.000 --> 00:46.000
up because of fire safety, so you have to find a seat.

00:46.000 --> 00:53.000
So please kind of scoop to the middle, so everyone can find something.

00:53.000 --> 01:18.000
And now, okay, so this works.

01:18.000 --> 01:20.000
So hello, everyone.

01:20.000 --> 01:22.000
My name is Max.

01:22.000 --> 01:25.000
I'm a production engineering TLM at Meta.

01:25.000 --> 01:31.000
And today I would like to talk a little bit about how we use BPS to accelerate our troubleshooting

01:31.000 --> 01:37.000
capabilities and basically how do we run GPU workloads at scale and how BPA helps us to make

01:37.000 --> 01:41.000
it reliably and sustainably, I should say.

01:41.000 --> 01:47.000
Also, it's the first time I present such a huge VNUS positive, so I would apologize if

01:47.000 --> 01:49.000
they will be any hiccups or any issues.

01:49.000 --> 01:55.000
So please take that with the grain of salt, but anyway, thank you so much.

01:55.000 --> 02:01.000
And just before the begin, some small, a couple of words about the micro-authors.

02:01.000 --> 02:03.000
So it's Procure and TL.

02:03.000 --> 02:05.000
Procure is a software engineer at Meta.

02:05.000 --> 02:10.000
He works in host networking space.

02:10.000 --> 02:15.000
So for example, he actually worked on the implementation for this specific

02:15.000 --> 02:20.000
program and there was a professor at CMU who actually inspired us to come up with this presentation.

02:20.000 --> 02:23.000
And basically a couple of words about myself.

02:23.000 --> 02:29.000
So I'm a production engineer and I work in GPU, comes distributed, comes in collective

02:29.000 --> 02:30.000
comes space.

02:30.000 --> 02:35.000
And basically, I actually came up with the initial problem, which tried to solve using BPS here.

02:35.000 --> 02:42.000
So, and basically this talk is about how we use BPS for RJMA.

02:42.000 --> 02:47.000
And it's necessary to talk a little bit about their domain subsystem, instead of an external.

02:47.000 --> 02:55.000
Because it's a little bit different from any other system we actually do when we try to do BPS tracing, for example.

02:55.000 --> 03:02.000
So let me start quickly with the quick overview of what are the GPU workloads at Meta and

03:02.000 --> 03:05.000
basically everywhere else actually.

03:05.000 --> 03:07.000
So we have a lot of GPU machines.

03:07.000 --> 03:10.000
They have some number of GPUs typically eight.

03:10.000 --> 03:14.000
And they are actually connected with the three types of networks.

03:14.000 --> 03:18.000
So called scale up, scale out, and the so called front and network.

03:18.000 --> 03:23.000
So scale up network is the sum of this proprietary GPU to GPU implementation.

03:23.000 --> 03:28.000
For example, envelope or some alternatives from Google, from Indy.

03:28.000 --> 03:30.000
So like there are millions of them actually.

03:30.000 --> 03:34.000
There is also front and network, which is a regular access links.

03:34.000 --> 03:39.000
We use to control the servers and load the data from the internet and so on and support.

03:39.000 --> 03:45.000
And basically we are coming to the scale out network, which allows us to really run GPU workloads at scale,

03:45.000 --> 03:51.000
and connect GPUs like thousands of GPUs inside the same collective operations at the same time.

03:51.000 --> 03:55.000
And here is the place we actually use our DNA.

03:55.000 --> 04:01.000
And obviously, like probably it's just a luxury for Meta that we have two separate front and

04:01.000 --> 04:06.000
the back and network, so we have quite a lot of network interfaces and strategies in the server.

04:06.000 --> 04:12.000
But anyway, as you can imagine, these back and interfaces, they are actually paired with GPUs.

04:12.000 --> 04:17.000
And they have really, really high bandwidth, really high connection speed.

04:17.000 --> 04:22.000
And in this situation, the regular network in stack can just keep up with the capabilities of such device.

04:22.000 --> 04:26.000
So in this sense, this is why we need to use our DNA here.

04:26.000 --> 04:33.000
The last, another important point is that RDMA unlocks the ability for so-called system parallelism.

04:33.000 --> 04:41.000
When you can actually run data flow between GPUs and GPUs and nix completely independently from CPU.

04:41.000 --> 04:50.000
So in this sense, it actually creates an unique opportunity to achieve lower latency and higher bandwidth.

04:50.000 --> 04:55.000
So basically this is an explanation why we need to use our DNA.

04:55.000 --> 05:02.000
And just let me quickly mention two main features of RDMA is that it obviously avoids network in stack.

05:02.000 --> 05:09.000
And it also allows direct connections between GPUs and network interface.

05:09.000 --> 05:14.000
Now, okay, we have an RDMA, so how we can actually use it in our workloads.

05:14.000 --> 05:21.000
And let me also give some quick introduction how actually RDMA works together with GPU workloads.

05:21.000 --> 05:28.000
So first of all, for the RDMA, and this is something which is more or less unique for the RDMA stack,

05:28.000 --> 05:31.000
you need to do some sort of server coordination.

05:31.000 --> 05:38.000
So obviously, when you have some GPU workload, you have multiple like hosts and also like multiple GPUs,

05:38.000 --> 05:41.000
but participating in the whole workload.

05:41.000 --> 05:50.000
And ideally, also by defining some GPU workload, you already understand which communication channels you need to set up between different nodes.

05:51.000 --> 05:54.000
And in this sense, like you know the topology.

05:54.000 --> 05:58.000
And but the problem is that you also need to do some sort of a discovery.

05:58.000 --> 06:05.000
You need to basically change some sort of a tokens, which will allow us to initiate or make a connection.

06:05.000 --> 06:12.000
And this is something which basically creates a huge difference between RDMA and the regular TCP network in stack.

06:12.000 --> 06:22.000
When you can actually just connect to the specific IP address, like get automatic allocating outgoing port and port and everything will actually work.

06:22.000 --> 06:30.000
In RDMA, you do need to know exact QP number, QP is actually RDMA socket in this sense.

06:30.000 --> 06:39.000
You need to know this exact QP number of source and remote destination and actually both sites should know this information beforehand.

06:39.000 --> 06:43.000
So this is why we have a front end network and this is why we need the server coordination.

06:43.000 --> 06:54.000
To basically exchange all required information for every nodes to be able to establish a RDMA connection across the whole, across the GPU workload.

06:54.000 --> 07:03.000
This was the first step and the second step is basically basically initiating a RDMA connection, which involves a lot of heavy lifting on the kernel site.

07:03.000 --> 07:13.000
Because you need to create a lot of objects, you need to transition this QP between states between different states, basically to connect between different remote nodes.

07:13.000 --> 07:26.000
And also you need to basically upload some of the, some of the comments, some of the information inside the NIC in order for RDMA connection to be established and to start functioning.

07:26.000 --> 07:37.000
And the very last step of this set of preparation is actually mapping basically natural interface can fix space into the user space.

07:37.000 --> 07:44.000
So basically creating this bypass mechanism, which basically brings the whole brilliance of RDMA stack there.

07:44.000 --> 07:55.000
This happens on the last step and basically when everything is connected, when everything is initialized, when we also mapped memory regions towards specific, specific GPU NICs.

07:55.000 --> 08:07.000
Sorry, to specific network interface, to specific GPUs, we basically say, okay, now everything is ready, let's map map the control control path into user space and allow application.

08:07.000 --> 08:13.000
Her domain application, collective library, everything just worked directly with a natural interface.

08:13.000 --> 08:21.000
So this is how this is the role of Linux kernel inside the specific process.

08:21.000 --> 08:34.000
And the, basically, the most interesting part is that most of the job failures must, most of the reliability issues actually happen during step two during initialization.

08:34.000 --> 08:50.000
So most of the time on practice, if you were able to establish all the connections, and if you were able to at least once run, send some data through them, they probably will stay reliable up until full job completion unless you will run into some real hardware.

08:50.000 --> 08:55.000
Some real hardware issues, hardware or network issues, which are obviously out of scope of this talk.

08:55.000 --> 09:09.000
But basically there are quite a lot of situations when initialization may fail and basically around 10 to 20% of a age of crashes, they actually related to this system called errors,

09:09.000 --> 09:18.000
in which by GPU jobs like this is our metexample, but I think this should be a general thing at scale for any other company.

09:18.000 --> 09:36.000
And specifically for us, for last six months, we actually observed around 32K plus GPU hours related to system call errors reported by our failed GPU jobs and 5K plus of them were actually attributed to our DNA subsystem.

09:36.000 --> 09:49.000
So this specific issue is pretty significant, plus sometimes you can also run into very nasty situation, which hurts, I should see flagship model workloads.

09:49.000 --> 10:00.000
Quite a lot is a crash loop. So basically if your system, if you're set up, if you're configuration actually run into some steady erratic state and your job cannot actually start.

10:00.000 --> 10:11.000
Just keep wasting your resources in attempting to start the workload and this is the biggest issue and this attracts a lot of attention from everybody around.

10:11.000 --> 10:24.000
And so this is why actually system call and nerdy may system call errors are really important in a real production if you really try to sustain a thousands of GPU training.

10:24.000 --> 10:41.000
All right, so I hope that I was able to demonstrate to the importance of this specific specific topic. So let's talk a little bit about what we each type of errors we can actually see during the training and how they are related to DNA actually.

10:41.000 --> 10:51.000
So this is a typical training stack we may observe for example at metexample like we have by torch and the model called as application layer is a full user space.

10:51.000 --> 11:12.000
It actually uses collective libraries and this is a basic area I work in, collective libraries such as NCCL and media collective library, which does all the heavily, all the heavy lifting for ML or AI engineers in terms of organizing establishing and sustaining GPU to GPU connectivity.

11:12.000 --> 11:25.000
In order to do that, of course, we need to interact with Rdma as well and here is where Rdma core set of libraries such as the by be verbs comes into play.

11:25.000 --> 11:36.000
And basic it handles all the business but still on the user space layer, it handles all the business related to Rdma related to network interfaces and its connection with GPUs.

11:36.000 --> 11:56.000
In reality, for example, when we map user space, when we map NCCL and fix space into the user space rather, it's exactly Rdma core, which contains this user space driver to interact with the network interface and it actually provides standardized verbs interface up to the, for example, NCCL library.

11:56.000 --> 12:12.000
But again, in some situations, we still need to interact with the kernel and Rdma core obviously interacts with Rdma subsystem, which also contains with two vendor specific and vendor independent part.

12:12.000 --> 12:24.000
And basically vendor specific part of Rdma Rdma subsystem and Rdma also interacts with the, with the NCC firmware or with the NCC hardware.

12:25.000 --> 12:34.000
And as you can imagine, there could be issues at any layer of this specific stack and there is a plenty of work to troubleshoot it.

12:34.000 --> 12:42.000
But let's focus on the system call errors, so something between Rdma core and kernel.

12:43.000 --> 13:06.000
So as I have mentioned, they impact why it's a quite significant fraction of our jobs and we really had some that actually this work started with investigating of a single issue when we started observing some pretty important GPU workload entering crash looping pattern and emitting the specific system call error.

13:06.000 --> 13:24.000
So what basically what what was the obvious idea which came to my mind is basically to use a trace in the function graph mode for us to be able to understand what's happening inside the kernel just to understand what specific part of the kernel actually throws throws an error and what's going on.

13:24.000 --> 13:33.000
But it was like pretty actually a tedious process and also to go out to the manual analysis of the whole trace.

13:33.000 --> 13:39.000
And also at some point we also realized that we have actually a lot of such system call errors happening in production.

13:39.000 --> 13:52.000
So we understood that hey this, this manual approach is unscalable and we need to figure out something which will help us to automate this process of identifying and instigating Rdma system called errors.

13:53.000 --> 14:05.000
And basically the obvious solution is EPPF so probably I just I won't mention all this benefits of using EPPFs for this specific kind of stuff because like it's pretty obvious to this audience.

14:05.000 --> 14:26.000
But just want to mention that again, as I mentioned before, Linux kernel still plays significant role in GPU workload in civilization and again most of the issues are actually happening like if there are some issues happening with the GPU workload they're actually happening during an imposition phase and they happen inside the kernel.

14:26.000 --> 14:33.000
So this is by BFF is a perfect solution for us to to look and understand what's going on there.

14:34.000 --> 14:58.000
And before I will jump to actually details of our solution I also would like to mention very also refer the previous presentation because in the Rdma of course there are some situations when we may actually see failures during the data flow parcel like this third stage when we actually actively run communication and use user space.

14:58.000 --> 15:15.000
And this is why this is this is actually the story about in lining you probes and all this stuff just because like for example some of the network issues are actually reported only to user space and this is one of the wired features of Rdma stack that.

15:15.000 --> 15:40.000
If you if you actually get your QP transition in the theoretical state this information may never reach kernel so basically only user space will well know about that and if you want to get observability for such specific issues you need to mess with you probes and in lining is one of this specific pitfalls you need to manage on the come space with the tracing you probes and tracing of the image stack as well.

15:40.000 --> 16:01.000
All right so let's get back to the topic of using bpf for gma system called error tracing and yeah basically when you decide to use bpf that you actually facing three three class of big challenges and let me go one one after another.

16:01.000 --> 16:16.000
So the first one we need to actually select specific trace points specific functions you want to watch inside your ms stack and there are some caveats there are some interesting gotches there as well because.

16:17.000 --> 16:28.000
Rdma subsystem I be very subsystem has only single entry point which is the single IOCTL related to I be verbs and this is the problem because like if you start tracing it.

16:29.000 --> 16:38.000
You won't get enough information so you won't understand what's going on you will just like get some obscure error message and as you probably all know.

16:38.000 --> 16:56.000
Sometimes every system called may return single error message for multiple different situations for multiple different problems so this is this is clear and issue if you have a single entry point to the to this the whole tree of different parts of the whole tree of different functionalities handle handle bait.

16:57.000 --> 17:08.000
So actually in order to to be able to to pick proper trace points we applied static analysis so we just try to look at this whole call graph produced by.

17:09.000 --> 17:31.000
And we also applied couple of heristics first of all we avoided read on the calls because like if you query some information this probably won't be to any workload favor and we are actually not interested plus also there could be some read on the system calls when you agree for some information just returns you know and no entity error which is completely fine for us.

17:32.000 --> 17:43.000
We also tried to ignore sub trees related to pure kernel common code functions because like after all like if you have some management function function failing.

17:43.000 --> 17:52.000
Probably it will also be visible not inside not only inside their day may but everybody else so this is why we decided to ignore this specific subtree.

17:52.000 --> 18:06.000
And we also tried to avoid clean up reference counter functions just because like during the session you don't usually do clean up and in this situation this is why you you're not looking into it.

18:07.000 --> 18:14.000
So this is the example of least of functions we actually used for for our for our tracing solution.

18:15.000 --> 18:23.000
And then basically like we have established the set of trace trace points let's also think how we will implement that on the technical side.

18:23.000 --> 18:32.000
So basically we also had this problem to decide between key probes, key red probes and the factory effects it.

18:32.000 --> 18:46.000
But in the end we actually pick key functions we picked essentially effects it just because we actually first of all we did not need such huge flexibility provided by k probes because after all we only needed.

18:46.000 --> 19:00.000
Return values and we only need specific functions so like hey you just to get to get the signal if you have entered this specific function or if you have left this specific function so we did not need all this advanced functionality provided to us by key probes.

19:00.000 --> 19:19.000
Also speaking about stability of the kind of interface I should say that key fonts are actually a little bit more stable than key probe in terms of like function like if you're if you're lucky actually function names are not changing quite frequently so in this sense in terms of.

19:19.000 --> 19:30.000
In terms of long term support using using key functions actually is more more pleasant but more pleasant thing to manage it.

19:30.000 --> 19:45.000
And the basic for us the main goal was the lowest possible overhead because we wanted to have a scalable solution to monitor every GPU host in the fleet and obviously we may see different workloads with different corner cases so we wanted to.

19:45.000 --> 20:00.000
Protect ourselves of any interference with the GPU workload because after all like if you bring some overhead you actually wasting GPU hours and nowadays GPU hours are pretty expensive for every company for every customer.

20:00.000 --> 20:09.000
Of course like when we did our implementation we run the comparison between key fonts and key probes.

20:09.000 --> 20:22.000
Applied to our workloads and then in this sense we see that basically a fancy effect the thing is roughly takes roughly 7 to 2% of a key probe so it was like.

20:22.000 --> 20:34.000
Step ahead for us and it provided us some headroom in terms of enter the forward head so we actually used kind of benchmarking tool just for pure comparison and also we run nickel benchmarking tests.

20:34.000 --> 20:45.000
Basically we see around some synthetic workloads to understand how it will actually change the installation time of some synthetic GPU workload.

20:46.000 --> 20:53.000
I'm so basically ended up with key fonts and here is the whole enter and solution design so basically we have multiple.

20:53.000 --> 21:05.000
Affentially effects it drops placed on in the specific spots inside the kernel inside the RNA subsystem all of them actually collect unit events we put inside.

21:06.000 --> 21:15.000
Bpf ring buffer Bpf map with around 256 kilobit buffer and like around for around 8k 8k records.

21:15.000 --> 21:23.000
Then we basically have a user space application which kept pulling this buffer and the loading of this information and actually like.

21:23.000 --> 21:33.000
I should say busy pulling with some sleep intervals was enough for us to be able to not use any information and to get all the events we actually expected.

21:33.000 --> 21:44.000
And basically we export it all the data in our observability DB which is scuba and I think like this although this specific system is not on the source it's actually public.

21:44.000 --> 21:52.000
And but you know for your own solution at home you can use actually any observability database you want.

21:52.000 --> 22:08.000
And also in addition to just emitting events we also implemented some counters which we exported to our time series database just for us to use for a lot and for the whole accounting for specific events or specific error knows or.

22:08.000 --> 22:12.000
Just understand what's going on inside the system.

22:13.000 --> 22:20.000
And yeah below you can see the example of some of our events detected by our solution.

22:20.000 --> 22:31.000
And I've also during deployment we also did another researching in order to again to to ensure that we actually do not introduce any.

22:31.000 --> 22:37.000
Regretions or any specific issues and as you can see like this is a graph of CPU utilization on the host.

22:37.000 --> 22:43.000
Election nothing changed in this sense plus also we actually implemented this.

22:44.000 --> 22:51.000
Regretion matrix or solution insights net edit because also I think it's it's also a non solution.

22:51.000 --> 23:05.000
It's also public information available on the internet about this net edit system so like we have a huge I should say well this why distributed binary which manages multitude of BPS running on the host.

23:05.000 --> 23:13.000
For example attribute it to the to the network in stack and our BPS related to the matrix actually took.

23:13.000 --> 23:20.000
Like it was the probably the most lightweight B pf B pf on our fleet.

23:20.000 --> 23:32.000
And let me also quickly touch upon the third main challenge is that okay we have this monitor and system so we get a signal that something is happening at scale.

23:32.000 --> 23:41.000
In the fleet but unfortunately the problem is that you still need to process this signal you still need to go around some manual tracing you still need to use.

23:41.000 --> 23:53.000
If trace or you need to use red snoop to get the whole calling stack and just to understand okay which specific place in the kernitors returns this specific error code.

23:53.000 --> 24:10.000
And basically there are again another some some types of characteristics which help us to manage this specific situation and also you also need to remember that kernel changes and everything changes so we also probably need to extend the set of your trace points in your workload as well.

24:10.000 --> 24:20.000
So you need to run ideally you need to run this periodical audit and you need to understand what changes inside the code inside errors and their code traces inside the kernel.

24:20.000 --> 24:34.000
But unfortunately we have not able to automate it yet so we just know that this is one of the directions we need to move in order to create a evolving system related to monitoring the errors.

24:34.000 --> 24:54.000
So yeah just a couple of words about the Demetraiser in production today so it actually significantly reduces diagnosis time from like tens of minutes to actually second say it allows us to practically detect some emerging the mayor system called issues on our fleet.

24:54.000 --> 25:06.000
And also it help us to engage less people and instigation and instigating specific threshold for specific situations so yeah this is this is a success for us.

25:06.000 --> 25:18.000
And basically like utilizing last couple of minutes of my report I also can mention our case study so this specific word actually started when the system called.

25:18.000 --> 25:27.000
One of the system calls called IBV registered me the me above so actually it was ABV registered memory region failed with you know man.

25:27.000 --> 25:31.000
Erick old yes you see a personal term it's about this is a different example.

25:31.000 --> 25:39.000
So it actually resulted in a man and initially we thought like okay we're actually lacking some memory so like the the host is under memory pressure.

25:39.000 --> 25:51.000
But when we started investigating we actually discovered that this happens during utilization so basically this situation memory usage is pretty low and also.

25:51.000 --> 26:06.000
All numbers are actually verifying so like we were okay on GPU bar GPU bar usage on the memory usage so nothing will actually trigger this you know menstruation and we when we started to investigate that we actually.

26:07.000 --> 26:14.000
Discovered some because like we use some like proprietary code and so they're okay all it there was some.

26:15.000 --> 26:28.000
A variable conversion issues between like 64 and 32 bits integers and then we actually implemented this solution and we were able to discover another issue a little bit later which actually.

26:28.000 --> 26:41.000
Flore it up with the same symptoms but it was the complete the complete different situation related to some issues inside the wonder when the GPU driver and huge pages config we used in production.

26:41.000 --> 26:44.000
So yeah just a quick summary.

26:44.000 --> 27:00.000
Again the dummy system colors have a system call errors have a significant impact at scale and so this is why we actually implemented the matrixer which is scalable because of design choices because and because of that just using bpf I should say.

27:00.000 --> 27:09.000
And it the matrixer actually is a nice three lines are diagnostics and detection process and allow the lowest us to automatically see five issues as well.

27:09.000 --> 27:11.000
So yeah thank you so much.

27:11.000 --> 27:26.000
Yeah I need questions.

27:42.000 --> 27:53.000
Just a little bit clarify so from the bb program what exactly information do you collect is it just the whole stack trace is it just a fact that there's some not in this called graph was hit or what exactly.

27:53.000 --> 28:07.000
And so basically yes just the fact that we specifically we have this specific function not the system called but function inside the kernel was hit and also there are a total of this specific function which actually allows us to recreate the picture and understand what happened.

28:08.000 --> 28:25.000
For this system we collect only selected traces and then when we basically understand okay this is a significant issue we go we jump and go it's trying to recreate the whole call trace to understand that can basically be the issue.

28:25.000 --> 28:28.000
Thank you.

28:37.000 --> 28:39.000
Thank you.