WEBVTT

00:00.000 --> 00:10.000
For the next presentation, we've got Sarmat talking about performance and reliability and

00:10.000 --> 00:11.000
analysis.

00:11.000 --> 00:12.000
I hope you feel sorry.

00:12.000 --> 00:13.000
For BPM.

00:13.000 --> 00:18.000
There you go.

00:18.000 --> 00:27.000
Hello everyone.

00:27.000 --> 00:32.000
Thank you so much for taking out the time to be present at this talk.

00:32.000 --> 00:37.000
My name is Sarmat Socket and I'll be talking about the performance and reliability

00:37.000 --> 00:39.000
pitfalls of EBBF.

00:39.000 --> 00:41.000
So just a little bit about me.

00:41.000 --> 00:43.000
I work as a software engineer at Data Dog.

00:43.000 --> 00:48.000
I work at EBBF as part of the EBBF platform team.

00:48.000 --> 00:52.600
I enjoy exploring and working with internal internals and thus the direction I'll be

00:52.600 --> 00:55.000
approaching the stock from.

00:55.000 --> 01:00.000
So the way I'll be covering this topic of performance and reliability will be through

01:00.000 --> 01:04.000
three stories about some issues that we ran into.

01:04.000 --> 01:11.000
And hopefully I'll be able to thread a common team through them and talk about what is shared

01:11.000 --> 01:17.000
between all of these that could be considered like the general team of the problems that we're addressing.

01:17.000 --> 01:22.000
So we're talking about some of the pains with scaling care at probes that we've experienced.

01:22.000 --> 01:30.000
This deadlock in FNG that we came across once and recently a production incident that we were trying to resolve

01:30.000 --> 01:33.000
a few probes were at the center of everything.

01:33.000 --> 01:35.000
Well kind of next to the center.

01:35.000 --> 01:38.000
There was something else in the center.

01:38.000 --> 01:43.000
So starting with the scaling pains of care at probes some time ago,

01:43.000 --> 01:49.000
we got this customer issue where they were complaining that when they install our agent,

01:49.000 --> 01:55.000
suddenly CPU spikes and their network throughput is completely decimated.

01:55.000 --> 02:01.000
The customer was running a pretty old kernel version so kind of 4.14 with Amazon and X2.

02:01.000 --> 02:05.000
And they had a very large machine with 96 cores.

02:05.000 --> 02:09.000
So we set up the similar environment,

02:09.000 --> 02:16.000
ran our agent started looking at Perf to see if we could kind of just easily pinpoint where the problem is coming from.

02:16.000 --> 02:18.000
And that turned out to be pretty helpful.

02:18.000 --> 02:27.000
So before I go into what the actual problem was just to give you a brief overview of how care at probes function.

02:27.000 --> 02:34.000
So as mentioned before the optimised care probes are what you commonly see so that's the schema I'll be talking about today.

02:35.000 --> 02:40.000
So care at probes basically instrument a trampoline instruction in the beginning of a function.

02:40.000 --> 02:47.000
This trampoline goes to a trampoline function which is responsible for overriding the return address on the stack.

02:47.000 --> 02:50.000
So that when the function returns you go to your hook.

02:50.000 --> 02:57.000
And it also saves some invocation specific state in this structure called the care at probe instance structure.

02:57.000 --> 03:03.000
So some of the say that it saves is the original return address so you can eventually return back to it.

03:03.000 --> 03:11.000
And the point thing to note over here is that these care at probe instance structs are invocation specific.

03:11.000 --> 03:15.000
So each invocation of the function gets its own struct.

03:15.000 --> 03:22.000
So these structs are present in the care at probe structure as a link list.

03:22.000 --> 03:26.000
And in order to synchronize access to this link list,

03:26.000 --> 03:29.000
the kernel sort of one for was using a spin lock.

03:29.000 --> 03:36.000
So you can imagine that if you have this massive machine, let's say running a bunch of TCP received messages in parallel.

03:36.000 --> 03:41.000
All of these invocations are serializing on this spin lock and that becomes a massive massive bottleneck.

03:41.000 --> 03:49.000
And when you run ran per stop, this reflected this observation we saw large amount of CPU time being spent in this spin lock.

03:49.000 --> 03:54.000
And that obviously explains the massive drop in throughput.

03:54.000 --> 04:02.000
So unfortunately since this is a problem, as you see later on, all of these were problems with the surrounding infrastructure of the BPS,

04:02.000 --> 04:07.000
the only recommendation that we can give is to upgrade your kernel version.

04:07.000 --> 04:12.000
Unfortunately pretty soon you found that wasn't enough up until the certain point.

04:12.000 --> 04:15.000
Like soon afterwards we received another customer impact.

04:15.000 --> 04:22.000
And this time the customer was running kernel 6.6 again with a very large instance type and again with the same symptom.

04:22.000 --> 04:25.000
Network gets destroyed.

04:25.000 --> 04:28.000
Like the throughput gets destroyed.

04:28.000 --> 04:33.000
So this time around Perth wasn't very helpful in pinpointing anything.

04:33.000 --> 04:36.000
So we started looking into what changes have happened.

04:36.000 --> 04:40.000
In the kernel especially in the scary probe infrastructure part since then.

04:40.000 --> 04:51.000
And we found that in order to adjust that spin lock intention issue, the kernel developers had come up with a solution of putting the scary probe in

04:51.000 --> 04:55.000
the infrastructure in this lock list field list.

04:55.000 --> 05:00.000
So basically there's no lock so there should be no intention.

05:00.000 --> 05:06.000
But if you look at the implementation of this field list, right at the top of the page in the header comment,

05:06.000 --> 05:09.000
you see this very helpful hint as to what the problem could be.

05:09.000 --> 05:13.000
It's not the fastest thing in the world under heavy contention.

05:13.000 --> 05:16.000
And that's precisely this scenario that we're running into.

05:16.000 --> 05:26.000
So what happens over here? You have this lock list field list, but you have this shared memory region that is being written to simultaneously by a large number of CPUs.

05:26.000 --> 05:32.000
So memory itself at this point becomes a bottleneck because of excessive cache coherency traffic.

05:32.000 --> 05:36.000
And obviously this gets worse as you add more and more CPUs.

05:36.000 --> 05:42.000
And since we were running a thing this morning in the massive instance, the effect was particularly dramatic.

05:42.000 --> 05:51.000
So I realized that I did not attach the first step output in the slide, but ultimately when we used birth to record some of them,

05:51.000 --> 05:56.000
my rocket actual performance counters, we in fact saw that we were bottlenecking on memory,

05:56.000 --> 05:59.000
like we were spending most of our time waiting for memory to come.

05:59.000 --> 06:04.000
So again, recommendation is a great kernel, sorry.

06:05.000 --> 06:15.000
And luckily enough after a kernel 6.6, I in fact I think in the later stages of the releases of kernel 6.6, this problem was ultimately addressed.

06:15.000 --> 06:24.000
And now K-Red probe used this object pool mechanism, which uses this birth CPU rig, to store the K-Red probe instance structures.

06:24.000 --> 06:29.000
And since it's birth CPU, very scalable, you don't run into this problem.

06:29.000 --> 06:39.000
So while we were investigating these performance characteristics of K-Red probe, we were also realizing that, hey, we don't want to be running into these problems in our infrastructure.

06:39.000 --> 06:44.000
And back then, unfortunately, we were also behind this optimization as well.

06:44.000 --> 06:51.000
So we decided, let's do a way with K-Red probe, let's move into the modern world, let's use F entry.

06:51.000 --> 07:03.000
We did that, we made the implementation, and then pretty soon we started getting reports of some failures on when the agent container was exiting.

07:03.000 --> 07:09.000
So we set up an environment, launch the container, solve the container, and you see this error.

07:09.000 --> 07:12.000
Try to kill container, but it don't receive an exit event.

07:12.000 --> 07:15.000
Kind of gives you the hint that something stuck somewhere.

07:15.000 --> 07:24.000
So we used this RQ to dump all tasks in uninterruptible sleep, to see who's the one causing a problem.

07:24.000 --> 07:29.000
And lower than behold, you do have some tasks stuck in an uninterruptible sleep.

07:29.000 --> 07:31.000
And in this case, you have to.

07:31.000 --> 07:35.000
So on the left hand side, you see the star score system probe.

07:35.000 --> 07:40.000
So system probe is a component of the agent, which is responsible for managing BPS and stuff.

07:40.000 --> 07:46.000
And on the right hand side, you have this RCU task, K-Red, it's stuck somewhere as well.

07:46.000 --> 07:50.000
So the system probe is trying to detach F entry.

07:50.000 --> 07:56.000
And while it's attaching F entry, it calls this function, synchronize RCU tasks.

07:56.000 --> 07:59.000
And once it calls that, it gets put to sleep.

07:59.000 --> 08:04.000
And on the right hand side, you have this kernel thread, which is doing some kind of waiting.

08:04.000 --> 08:10.000
And in the middle of this waiting, it calls this other synchronize RCU function and gets put to sleep.

08:10.000 --> 08:15.000
At this point, you can already see that there's some kind of dependency over here, like system probe is obviously

08:15.000 --> 08:17.000
depending on this kernel thread.

08:17.000 --> 08:21.000
But what's preventing this kernel thread from making forward progress?

08:21.000 --> 08:28.000
So whatever it was, wasn't showing up in the list of tasks that are stuck in uninterruptible sleep.

08:28.000 --> 08:31.000
So we say, let's dump everything.

08:31.000 --> 08:37.000
All right, pass equity to this RQ and the kernel just shows you all the tasks that are on all CPUs.

08:37.000 --> 08:43.000
And scrolling to that pretty long list, we note, this catches our eye.

08:43.000 --> 08:45.000
S6 SV scan.

08:45.000 --> 08:51.000
So S6 are the set of process management utilities that Docker uses.

08:51.000 --> 08:54.000
And so by the way, this was Docker container.

08:54.000 --> 08:57.000
So process management utilities that Docker uses.

08:57.000 --> 09:03.000
What, a quarter eye, because this is a Docker, and we see the issue happened inside Docker.

09:03.000 --> 09:05.000
And also, this is the parent of the system probe.

09:05.000 --> 09:11.000
Not only is the parent of the system probe, it's also the PID1 in the PID namespace of the container.

09:11.000 --> 09:17.000
So if you look at what it's up to, it's trying to destroy the PID namespace and as part of it,

09:17.000 --> 09:20.000
it's waiting on something.

09:20.000 --> 09:23.000
So now you have these three entities that are in some way seen,

09:23.000 --> 09:28.000
to be related to each other, but we don't really know what's precisely causing the problem.

09:28.000 --> 09:38.000
In order to understand that, we need to take a step back and actually see how things work under the hood for each of the different entities that we're trying to understand.

09:38.000 --> 09:40.000
So with system probes, it's working with FNG.

09:40.000 --> 09:42.000
So let's take a look at what FNG is.

09:42.000 --> 09:48.000
FNG, very similar to carrot probe instruments, a sampling instruction on top of the function.

09:48.000 --> 09:54.000
But unlike carrot probe, where you jump into some shared infrastructure code and you have all of the state management stuff,

09:54.000 --> 09:56.000
this does something very simple.

09:56.000 --> 09:59.000
It just jumps to a trampoline image.

09:59.000 --> 10:02.000
In the beginning of the trampoline image, you run the FNG program.

10:02.000 --> 10:06.000
You call the function, and if you have an FNG program, you call that.

10:06.000 --> 10:08.000
It's very simple.

10:08.000 --> 10:11.000
The trampoline image is generated at low time.

10:11.000 --> 10:14.000
So this literally a stub associated with each trampoline.

10:14.000 --> 10:18.000
So this scales very well as well, because there's no shared state, no coordination problem.

10:18.000 --> 10:23.000
They need to solve, just read only memory, amazing.

10:23.000 --> 10:27.000
Now, on the other side, so that was system probe doing FNG stuff.

10:27.000 --> 10:32.000
On the other side, we had this RCU tasks, K thread.

10:32.000 --> 10:33.000
What is RCU tasks?

10:33.000 --> 10:36.000
So this is a subsystem of the Linux kernel.

10:36.000 --> 10:41.000
I don't know if we can call it a subsystem, but it's a different RCU flavor.

10:41.000 --> 10:48.000
And the facility that it provides to the kernel is to basically allow it to figure out

10:48.000 --> 10:52.000
when no process is holding a reference to some structure.

10:52.000 --> 10:56.000
So the way normal RCU works is also similar conceptually,

10:56.000 --> 11:00.000
but what the problem that's trying to solve is to provide the kernel with a mechanism

11:00.000 --> 11:04.000
to figure out when no CPU is holding a reference to a structure.

11:04.000 --> 11:09.000
So there are some invariants that the code has to observe and it's using normal RCU

11:09.000 --> 11:15.000
which is that only one process on a CPU can hold a reference in contrast.

11:15.000 --> 11:22.000
However, RCU tasks, multiple processes on a CPU can hold a reference to some structure.

11:22.000 --> 11:31.000
And the way the waiting for the release of the structure happens is that before the start of the weight,

11:31.000 --> 11:36.000
all of the tasks that were running need to have performed a voluntary context switch

11:36.000 --> 11:39.000
before the weight ends.

11:39.000 --> 11:42.000
So not preemption, but a voluntary context switch.

11:42.000 --> 11:44.000
So how does this work with FN3?

11:44.000 --> 11:46.000
Let's say you have this trampoline image.

11:46.000 --> 11:48.000
Similar to that PCP received message,

11:48.000 --> 11:54.000
example, bunch of parallel executions where a lot of tasks are executing in this code region.

11:54.000 --> 11:58.000
Let's say something happens, like you need to update that trampoline image,

11:58.000 --> 12:02.000
what the kernel will do is it will generate the new trampoline image,

12:02.000 --> 12:04.000
point the trampoline to that.

12:04.000 --> 12:10.000
So that all subsequent executions start executing in this new stop.

12:10.000 --> 12:14.000
But now the problem that the kernel needs to solve is how does it know that the old

12:14.000 --> 12:17.000
sampling image is safe to release.

12:17.000 --> 12:22.000
So what could happen can actually is let's say some task is executing in this old

12:22.000 --> 12:24.000
sampling image, it gets preempted.

12:24.000 --> 12:28.000
And by the time it comes back online, the kernel may have deleted the

12:28.000 --> 12:29.000
sampling image from underneath it.

12:29.000 --> 12:33.000
And suddenly it's executing junk instructions and you get a crash.

12:33.000 --> 12:37.000
That's not what we want, so this is where RCU tasks come in.

12:37.000 --> 12:47.000
As I said, RCU tasks allow the kernel to figure out when no process is holding a reference to some structure.

12:47.000 --> 12:49.000
And how does it do that?

12:49.000 --> 12:54.000
So basically the algorithm broadly is comprised of three parts.

12:54.000 --> 12:59.000
The first is that it scans the task list for all tasks on the run Q and has

12:59.000 --> 13:01.000
adds them to this hold out list.

13:01.000 --> 13:07.000
The hold out list is all the tasks which have not yet performed a voluntary context.

13:07.000 --> 13:13.000
So the second part which is highlighted because this is involved in the bug is it

13:13.000 --> 13:16.000
also has to synchronize with late stage do exit.

13:16.000 --> 13:21.000
And the reason it needs to do that is in late stage do exit.

13:21.000 --> 13:26.000
A task executing in late stage do exit is no longer on the task list.

13:26.000 --> 13:31.000
So the scan of the task list doesn't make that late stage do exit task,

13:31.000 --> 13:33.000
visible to RCU task subsystem.

13:33.000 --> 13:40.000
So what can happen over here if there's no synchronization is that if you have a probe somewhere in this late stage do exit.

13:40.000 --> 13:43.000
The same it gets be emptied out.

13:43.000 --> 13:47.000
The RCU task system does not see it, so the kernel removes the trampoline image.

13:47.000 --> 13:50.000
It executes the instructions and it does.

13:50.000 --> 13:58.000
So in order to prevent that you need to synchronize with the stage and to do this the kernel brings in another RCU variant.

13:58.000 --> 14:01.000
This is the SRC subsystem.

14:01.000 --> 14:09.000
And basically the idea over here is that the RCU task subsystem weights until all tasks that we're performing and exit.

14:09.000 --> 14:14.000
Bend the weights started, have done the exit by the time the weight ends.

14:15.000 --> 14:21.000
And it uses this tasks RCU exit, SRC you log to do that.

14:21.000 --> 14:28.000
Finally, the kernel loops over the whole list until all tasks have been seen from a voluntary context switch.

14:28.000 --> 14:33.000
And then it's safe to delete what you were trying to delete.

14:33.000 --> 14:37.000
Because at that point you know that nobody has a reference to it.

14:37.000 --> 14:43.000
So one crucial thing is that this was in part of the original RCU tasks implementation as far as I can tell.

14:43.000 --> 14:46.000
Maybe as far as I could see from the kernel history.

14:46.000 --> 14:53.000
This was explicitly added to resolve this problem of coordination with late stage to exit.

14:53.000 --> 14:56.000
So finally, we know what system provides up to.

14:56.000 --> 14:58.000
It's doing some frances stuff.

14:58.000 --> 15:01.000
We know how this RCU task K3rd is working.

15:01.000 --> 15:04.000
And the final entity that we were dealing with was S6S scan.

15:04.000 --> 15:06.000
So the zapper for the PID namespace.

15:06.000 --> 15:10.000
What it does straightforward, it sends a kill signal to all the children.

15:10.000 --> 15:12.000
Wates until they're dead.

15:12.000 --> 15:19.000
But crucially, it does this waiting inside of the tasks RCU exit RCU SRCU that's a mouthful.

15:19.000 --> 15:21.000
We decide critical section.

15:21.000 --> 15:27.000
So now we kind of have a more precise picture forming of what the dependencies are.

15:27.000 --> 15:31.000
So we have system probe, which calls synchronize RCU tasks.

15:31.000 --> 15:35.000
And it's waiting on RCU task K3rd to make formal progress.

15:35.000 --> 15:40.000
RCU task K3rd adds all runnable tasks to the whole list.

15:40.000 --> 15:47.000
It calls synchronize RCU whatever whatever to wait for tasks in late stage to exit to finish.

15:47.000 --> 15:53.000
And we have this S6S scan, which is also in late stage to exit.

15:53.000 --> 15:57.000
This is waiting for all processes in its PID namespace to exit.

15:57.000 --> 16:02.000
And it does this waiting inside of the SRCU recycle critical section.

16:02.000 --> 16:05.000
So now you see you have the circular dependency.

16:05.000 --> 16:08.000
One task is waiting to make forward progress.

16:08.000 --> 16:11.000
One task is waiting for the other one to make forward progress.

16:11.000 --> 16:14.000
And none of them can because this is a cycle.

16:14.000 --> 16:16.000
And that was the bug.

16:16.000 --> 16:26.000
So unfortunately, the fix for this was pretty late into the kernel versions from that time on.

16:26.000 --> 16:28.000
So in kernel 6.9.

16:28.000 --> 16:36.000
The author of this patch removed this SRC mechanism entirely for coordination and came with a different way.

16:36.000 --> 16:39.000
So that this problem no longer exists.

16:39.000 --> 16:43.000
And when we were trying to port over to FNT, at this patch had been back.

16:43.000 --> 16:45.000
But at this point it's been back ported.

16:45.000 --> 16:49.000
So it's safe to use FNT, I don't know how far it's been back ported.

16:49.000 --> 16:52.000
It's generally safe to use FNT.

16:52.000 --> 16:56.000
We also from this thread found another investigation into this issue.

16:56.000 --> 17:04.000
In this case the author was trying to run into this one dealing with PID namespaces.

17:04.000 --> 17:07.000
And when dealing with PID namespaces.

17:07.000 --> 17:14.000
But as far as I know, I haven't seen any documentation of anybody running into this problem with FNT before.

17:14.000 --> 17:19.000
So that was our foreign trying to scale here at probes.

17:19.000 --> 17:23.000
Try FNT failed because of a bug.

17:23.000 --> 17:31.000
And then now we can change the case a little bit and go to this other subsystem of the Linux kernel, which is the cube subsystem.

17:31.000 --> 17:38.000
And we recently were trying to resolve this production incident where we were getting reports of service,

17:38.000 --> 17:42.000
and unavailability on a few nodes in production.

17:42.000 --> 17:53.000
And so eventually somebody figured out that K-Hantosti was reporting that a bunch of tasks were just stuck.

17:53.000 --> 17:56.000
They were waiting for more than two minutes to proceed.

17:56.000 --> 17:59.000
Some of these tasks were specific to the production application.

17:59.000 --> 18:03.000
Some of them were like just random system tasks like bash, system D, run C, etc.

18:03.000 --> 18:08.000
And we got pulled into this is because amongst the stack races that K-Hantosti was reporting,

18:08.000 --> 18:12.000
there were stops, stack races pointing to your registration and your profiling.

18:12.000 --> 18:19.000
So we started investigating, looking to whether the agent could somehow trick in an issue.

18:19.000 --> 18:27.000
So we started the investigation again to each of the hypothesis, the deadlock again.

18:27.000 --> 18:32.000
We saw like an almost circular lock depends on dependency, but not really.

18:32.000 --> 18:34.000
So could not conclude the deadlock.

18:34.000 --> 18:42.000
Is it a memory corruption? Because we found a math that patch that the kernel that we were seeing running was missing.

18:42.000 --> 18:45.000
And it was talking about a memory corruption and an instruct.

18:45.000 --> 18:53.000
So we built a custom kernel and with some modifications to trigger this corruption deterministically to see what happens.

18:53.000 --> 18:58.000
But the symptoms that we saw in that case did not match with the symptoms that we were seeing in production.

18:58.000 --> 19:00.000
So not that.

19:00.000 --> 19:02.000
And then we had this while hypothesis.

19:02.000 --> 19:06.000
Some kind of live lock due to memory pressure that the kernel is trying to allocate memory,

19:06.000 --> 19:09.000
but due to memory pressure failing and it's looping somehow.

19:09.000 --> 19:16.000
But we evaluated all page allocation paths in the UPP register and you're profiling and this did not seem possible.

19:17.000 --> 19:24.000
So we had a lucky break when in order to establish this deadlock hypothesis.

19:24.000 --> 19:37.000
We had built this very simple EBBF based utility that was leveraging these conditions begin and condition end trace points to basically track lock intention in the kernel.

19:37.000 --> 19:44.000
And we stumbled across something that gave us the hint as to what could be going wrong.

19:44.000 --> 19:49.000
So in order to kind of set the stage for that.

19:49.000 --> 19:55.000
I was going to talk about a little bit about how UPP registration works, but I'll skip this because it's not really that important.

19:55.000 --> 19:59.000
I'll go over to this one which is more relevant.

19:59.000 --> 20:03.000
In that when the UPP registration code path is happening, you take a bunch of locks.

20:03.000 --> 20:08.000
So you take a re-trike center for lock on the UPP that you're dealing with.

20:08.000 --> 20:13.000
You take this interesting, dope, M map, same lock.

20:13.000 --> 20:17.000
So this lock is very interesting because it synchronizes.

20:17.000 --> 20:21.000
UPP registration with forks and the problem that is trying to solve is that.

20:21.000 --> 20:25.000
While in underage trace happening and a fork happens simultaneously.

20:25.000 --> 20:31.000
A child M instruction may inherit the debug trap that UPPs put in.

20:31.000 --> 20:33.000
Without an associated UPP.

20:33.000 --> 20:40.000
So this has to synchronize that the underage structure with the creation of the child M instruction.

20:40.000 --> 20:48.000
And then finally what basically what this was showing is that using the iNode of the file.

20:48.000 --> 20:52.000
The kernel discovers all of the MM strokes.

20:52.000 --> 20:56.000
So all of the address spaces that are mapping the file that we're trying to probe.

20:56.000 --> 20:59.000
It takes a right lock on it, does some work and unlocks it.

20:59.000 --> 21:05.000
So the reason it takes an right lock on the MM stroke is to synchronize with UPP firing.

21:05.000 --> 21:10.000
So the problem it's trying to solve over here is again with underage structure.

21:10.000 --> 21:15.000
That you don't want the UPP firing path to discover a UPP that you have.

21:15.000 --> 21:20.000
Underage district since then and it runs and it doesn't find.

21:20.000 --> 21:23.000
So basically you probe down the line and returns an error.

21:23.000 --> 21:27.000
So you need to have synchronization going on over there.

21:27.000 --> 21:32.000
Essentially we try to graph what was going on with the UPP registration path.

21:32.000 --> 21:38.000
And you see that in this system probe one execution is going down which you see a bunch of logs get taken.

21:38.000 --> 21:47.000
And every single case we were seeing it get just get blocked on trying to acquire a right lock on the MM block.

21:47.000 --> 21:58.000
So going over UPP firing the important thing over here is that it works with the address space of the process in which the UPP is firing.

21:58.000 --> 22:01.000
And it takes a lead lock on it.

22:01.000 --> 22:07.000
And with the case of UPP firing we were seeing a large number of tasks blocked on find active UPP probe.

22:07.000 --> 22:10.000
And this is the function where you take the lead lock on the MM stroke.

22:10.000 --> 22:14.000
And again it was blocked trying to acquire this lead lock.

22:14.000 --> 22:20.000
And lots of application threads were piling out waiting for this lock to happen.

22:20.000 --> 22:22.000
So what is this MM block?

22:22.000 --> 22:27.000
Well it's this lead writes MM4 allows concurrent readers but exclusive rights.

22:27.000 --> 22:34.000
So the EBU for utility that we have been talking about we modified it to try to get some questions answered about what's going on.

22:34.000 --> 22:40.000
The important thing over here is what the depth of the weight queue is when these issues happen.

22:40.000 --> 22:46.000
So the weight queue is like task waiting to acquire the semifur.

22:46.000 --> 22:50.000
What's the contention duration? Like how long is the contention going on for?

22:50.000 --> 22:52.000
And whether forward progress is always made.

22:52.000 --> 23:00.000
So this is added to rule out dead locks. And we figured out that even though things were waiting for like two minutes plus they were making forward progress.

23:00.000 --> 23:06.000
So eventually what we saw was that the weight queue became very very deep.

23:06.000 --> 23:12.000
And the pattern was that there were lots of readers with a few writers interest rates everywhere.

23:12.000 --> 23:19.000
And these writers were coming from MMAP and advice operations in the task and UPP registers.

23:19.000 --> 23:31.000
So the way read writes MM4's work is that every time a writer gives up the lock, it obviously wakes up all of the readers in the weight queue.

23:31.000 --> 23:37.000
But it only wakes up 250 readers at the same time to avoid like attending her situation.

23:37.000 --> 23:42.000
So this means that if the weight queue is broken up then batches of reads become serialized.

23:42.000 --> 23:53.000
And not only that, in order to prevent writer starvation, if there's ever a writer in the beginning of the weight queue, no concurrent readers are allowed.

23:53.000 --> 23:57.000
So this big creates a situation where readers start to pile up.

23:57.000 --> 24:04.000
And if they're writes interest rates here and there, you have these readers serialized in batches.

24:04.000 --> 24:09.000
And then eventually the writer makes it to the head of the weight queue and suddenly more readers.

24:09.000 --> 24:15.000
And if there's a very, if you have a UPP in the heart path, certainly a lot of readers are collecting.

24:15.000 --> 24:24.000
So this diagram is supposed to explain why we were seeing weird task like batch and run C getting stuck.

24:24.000 --> 24:33.000
So as I mentioned that in the UPP register part, let's imagine you're waiting for two minutes to take this MMAP lock because it's a very cute so deep.

24:33.000 --> 24:37.000
So I thought that you've taken this dope MMAP Sam with synchronizes with fork.

24:37.000 --> 24:43.000
So suddenly, all fork's in the systems are waiting on you to get this right MM lock.

24:43.000 --> 24:48.000
So suddenly if you do SSH, a batch comes up, it can fork nothing happens.

24:48.000 --> 24:53.000
System D can do anything, run C can do anything, your system is pretty much brick.

24:53.000 --> 25:01.000
And in the case of UPP firing, it's blocked on getting a read lock, but it's still blocked.

25:01.000 --> 25:09.000
In this case, application performance suffers because application threads just start to pile up in the kernel and they can make for progress.

25:09.000 --> 25:15.000
So by things happen, the application gets owned, there's unavailability, etc.

25:15.000 --> 25:28.000
So eventually there's a fix in the kernel and right now in the very latest ones, I think they have this speculative acquisition approach for getting the MMAP lock.

25:28.000 --> 25:36.000
So they don't get it unnecessarily, they only get it if they see that something has changed underneath them and so they repeat the operation with the lock in place.

25:36.000 --> 25:40.000
So this really improves the scalability of the operation.

25:40.000 --> 25:50.000
So basically the common thread that I was trying to get to was that concurrency is very hard.

25:50.000 --> 26:00.000
And sometimes even the BPS meant to be very usable and simple, the complexity of the kernel leaks into it, especially when you're trying to do things at scale.

26:00.000 --> 26:10.000
And sometimes you have to go beyond the BPS program to figure out where some problem may be coming from.

26:10.000 --> 26:16.000
So thank you, thank you for taking the time to listen.

26:16.000 --> 26:20.000
If you have time for questions, I'll be back.

26:20.000 --> 26:24.000
Yeah, we do have time for a couple of questions.

26:24.000 --> 26:26.000
Thank you.

26:26.000 --> 26:28.000
Just curious about the preempt.

26:28.000 --> 26:38.000
If you could talk a little bit more on that, I'm not sure if I quite got the relevance of the preempt on the slide you were discussing.

26:38.000 --> 26:48.000
So in this last part.

26:48.000 --> 26:49.000
Over here.

26:49.000 --> 27:07.000
So basically what you're trying to solve is that in certain configurations of the kernel, especially when you have when the kernel can be preempted, what may happen is that kernel code while at ask is executing kernel code, it can get scheduled out.

27:07.000 --> 27:13.000
So imagine that a process gets scheduled out, wireless executing the F entry mechanism.

27:13.000 --> 27:19.000
It technically still has a reference to the trampoline image, but the normal RCU cannot track that.

27:19.000 --> 27:22.000
Well, in some cases, in some cases, the can.

27:22.000 --> 27:36.000
So what you're trying to get a surety of is that the task that was executing F entry has performed a voluntary context switch, because at that point you know it has completed the F entry trampoline image.

27:36.000 --> 27:42.000
And given up execution, because it's done with it.

27:42.000 --> 27:44.000
Thank you.

27:44.000 --> 27:48.000
One last question.

27:48.000 --> 27:54.000
I guess the other conclusion is to always use the latest kernel.

27:54.000 --> 27:59.000
One other question is also the F exit related.

27:59.000 --> 28:04.000
Did it also expose the back or was it really only specific to F entry?

28:04.000 --> 28:07.000
It's the whole F entry effect.

28:07.000 --> 28:08.000
Okay.

28:08.000 --> 28:12.000
Because if you look at the doctors over here, it's the trampoline.

28:12.000 --> 28:13.000
Yeah, infrastructure that is.

28:13.000 --> 28:14.000
I get, yeah.

28:14.000 --> 28:15.000
I agree.

28:18.000 --> 28:19.000
Thank you.

28:22.000 --> 28:23.000
Thank you.

28:23.000 --> 28:24.000
Thank you.

