WEBVTT

00:00.000 --> 00:11.000
All right, for the next presentation, we've got Vladimirov talking about lightweight XDP

00:11.000 --> 00:12.000
profiting.

00:12.000 --> 00:14.000
So take it off.

00:14.000 --> 00:15.000
Hello, everyone.

00:15.000 --> 00:16.000
Can you hear me?

00:16.000 --> 00:17.000
Yes.

00:17.000 --> 00:18.000
Yes.

00:18.000 --> 00:19.000
Perfect.

00:19.000 --> 00:20.000
I'm Vladimirovaskali.

00:20.000 --> 00:22.000
I'm a PhD student at Sabienza University.

00:22.000 --> 00:26.000
And I'm actually here doing a listening video that you see

00:26.000 --> 00:31.000
with Professor Tomberbetter and speaking of Professor Tomberbetter.

00:31.000 --> 00:38.000
To my view, my interest that is actually open postdoc position.

00:38.000 --> 00:44.000
So if you're interested, if you do networking smart nix and a speedpocket processing, please do contact him.

00:44.000 --> 00:51.000
Otherwise, I'm here to talk about inspect, which is a lightweight profiling system for

00:51.000 --> 00:56.000
ITP application, which outperforms both inefficiency and profiling accuracy existing

00:56.000 --> 01:00.000
ABP profilers.

01:00.000 --> 01:10.000
While ABPF is often used to monitor other programs and for kernel tracing, and bugging the suite of tools capable of

01:10.000 --> 01:16.000
profiling the kernel part of the ABP application is limited with a pair of MPF tool

01:16.000 --> 01:19.000
paying a key role among them.

01:19.000 --> 01:26.000
So we evaluated five different XP application without profilers attached and measure their packet

01:26.000 --> 01:28.000
processing rate.

01:28.000 --> 01:34.000
And as we can see, most prominently in the drop application, which is a simple application

01:34.000 --> 01:36.000
simply dropping every packet.

01:36.000 --> 01:48.000
The drop goes from 15 million packets per second to less than four, which is a deep drop in performance.

01:48.000 --> 01:56.000
And can also make it hard to profile fast network functions such as also not dummy once

01:56.000 --> 02:05.000
like drop, but also CMS NASA and other can make them more challenging to profile.

02:05.000 --> 02:07.000
But how do profilers work?

02:07.000 --> 02:13.000
They rely on a set of specialized data to a register called performance monitoring counters,

02:14.000 --> 02:21.000
to track data about the different hardware events happening in the system.

02:21.000 --> 02:27.000
Most prominently, the extraction cycles, cache sheets, misses, and many others.

02:27.000 --> 02:33.000
To profile the kernel part of the ABP application, Perth and BPF tool attached to programs,

02:33.000 --> 02:40.000
the affinity and effects it around the target programs to read the PMC values before and after

02:40.000 --> 02:46.000
then they compute the difference and store the results for the user space to be able to read it.

02:46.000 --> 02:53.000
These affinity and effects it functions, although very fast, and probably the fastest one to

02:53.000 --> 02:59.000
wrap around the program, are still very expensive, completionally expensive,

02:59.000 --> 03:05.000
at the introduce as significant overhead as we saw before.

03:05.000 --> 03:15.000
This profiling overhead comes mainly from the effects it, the effects it and the effects it

03:15.000 --> 03:23.000
fentry and effects it functions, but as well comes from the Perth BPF Perth read value function,

03:23.000 --> 03:31.000
which is the one used by Perth to gather data from the PMC system, which is pretty slow.

03:32.000 --> 03:38.000
So these functions and these profiling functions get called for an HTTP application millions of

03:38.000 --> 03:46.000
time per second, which ends up drastically distracting the throughput.

03:46.000 --> 03:54.000
Not only the throughput is disrupted, but also the profiling accuracy, because as we see in the

03:55.000 --> 04:00.000
drop application it should be only around two instructions setting the action to drop a

04:00.000 --> 04:12.000
return in maybe a bit more, but Perth, the result of analyzing a drop program through Perth,

04:12.000 --> 04:21.000
says that this program has executed 627 instructions. This is because the Perth profiler

04:21.000 --> 04:27.000
considers also some of its own instructions used to profile the program before calling the

04:27.000 --> 04:33.000
read value again the second time. So computing the difference adds up some some of its

04:33.000 --> 04:39.000
own instructions instead. To solve this problem we develop the inspector, a lightweight HTTP

04:39.000 --> 04:46.000
profiler, which has three main components, a user space components that is set up, that

04:47.000 --> 04:55.000
tells the kernel part on which CPU events to record, two tracing macros to the limit

04:55.000 --> 05:02.000
of the profiling section, the part that you are actually interested in, and a kernel module

05:02.000 --> 05:09.000
to read the PMC's values efficiently called from the tracing macros instead of using the

05:09.000 --> 05:16.000
Perth function. So these are the tracing macros, start trace and then trace, they can be

05:16.000 --> 05:24.000
placed inside the any HTTP program, whenever you want it can be useful and can be used

05:24.000 --> 05:32.000
to profile some blocks of code or even individual instructions. In the, both traces, both

05:33.000 --> 05:39.000
macros, read the PMC values and store them, but the start trace is also managed to reach

05:39.000 --> 05:44.000
an activation if you actually want to profile the set region, and the sampling rate that

05:44.000 --> 05:50.000
will talk about in a few minutes. The entries instead, that's the difference between the

05:50.000 --> 05:58.000
start trace function and stores the results in any BPF map for the user space. But how do we

05:58.000 --> 06:07.000
access these values? So BPF has a little distraction set, and does not allow the use of native

06:07.000 --> 06:14.000
Excel, it seeks instructions such as RDPMC, which are needed to read the PMC values very

06:14.000 --> 06:22.000
fast and efficiently. This is why Perth has to use the BPF Perth or it value well per function

06:22.000 --> 06:29.000
to access them. To overcome this limitation, we developed, we developed a Linux kernel module

06:29.000 --> 06:37.000
that exposes a key function that invokes this RDPMC function, and retorts the value from

06:37.000 --> 06:50.000
the, from the counter itself. So by calling the macro directly from the HTTP program, we

06:50.000 --> 06:57.000
remove the need for effect, sit and defend, which saves us about 200 distraction for each

06:57.000 --> 07:06.000
call. Then directly accessing the PMC through the key function saves us another 200

07:06.000 --> 07:13.000
entity instruction, because we are not calling the BPF Perth or it value. As we can see in

07:13.000 --> 07:18.000
this graph, on the left we have the regular Perth profiling, on the right one we have

07:19.000 --> 07:24.000
where we still have around 40 instructions for each macro, which are mainly due to the

07:24.000 --> 07:31.000
call to the key function. These are some of the programs we used to evaluate our

07:31.000 --> 07:36.000
profiler. The drop is the dummy application that simply adopts everything. Accounting

07:36.000 --> 07:43.000
sketch is used for monitoring flow traffic and stores data inside a pretty big map,

07:44.000 --> 07:50.000
and not translate the BP, which is pretty similar to the tunnel, which does some

07:50.000 --> 07:56.000
imp encapsulation and router, which is our obvious, that simply that looks up

07:56.000 --> 08:03.000
in a pretty huge LPM3 to access the routing information.

08:03.000 --> 08:12.000
So as I told you a little bit before, there is some profiling inaccuracy when the profiler

08:12.000 --> 08:17.000
is actually pretty heavy, like Perth and BPF tool, because they add some of their own

08:17.000 --> 08:22.000
instruction to the instruction count. Here in this graph we can see the retired

08:22.000 --> 08:29.000
instructions that the application says that the profiler says the application is composed

08:29.000 --> 08:38.000
of or is actually running. All of this should be the same, but they are clearly not,

08:38.000 --> 08:44.000
because Perth and BPF tool are pretty heavy, and at more noise inside the profiling

08:44.000 --> 08:49.000
result. For example, the case study that drop application on the simple one that we

08:49.000 --> 08:56.000
showed you before, we expect around 2 per 600, which say about 40, so we are not

08:56.000 --> 09:04.000
perfect ourselves. This inconsistency and accuracy also happens in other

09:05.000 --> 09:11.000
while profiling other metrics, such as cash misses, eats, whatever, because it can happen

09:11.000 --> 09:17.000
that these events are caused by the profiler itself, not by the application under

09:17.000 --> 09:25.000
test. These are the results of the throughput while our application is the

09:25.000 --> 09:31.000
very supplication, are attached to different profilers compared to the baseline.

09:31.000 --> 09:39.000
Here we can see that our application, the inspector, is quite a bit better than the

09:39.000 --> 09:47.000
other profilers, but still have the performance of most of most XP application.

09:47.000 --> 09:53.000
The worst happens in drop, because since it is a very lightweight instruction that

09:53.000 --> 09:59.000
the weight of the profiler is proportionally high. To mitigate this problem, we implemented

09:59.000 --> 10:08.000
the sampling functionality that will increase performance by maintaining good

10:08.000 --> 10:18.000
enough results, let's see. Our sampling mechanism is composed of a simple counter

10:18.000 --> 10:24.000
that checks if the packet is in the actual sampling period. If it's in the sampling period,

10:25.000 --> 10:32.000
we check the PMCs and computer results and store them and do all the regular stuff.

10:32.000 --> 10:38.000
To somewhat fairly compare against the perfe, we also implemented the similar functionality

10:38.000 --> 10:47.000
inside the perfe. This sampling functionality, which was not supported, is still not supported.

10:47.000 --> 10:56.000
The time of doing this work for perfe starts.

10:56.000 --> 11:03.000
However, the performance gains from perfe are limited, because it still has to

11:03.000 --> 11:09.000
check to call the aventry to check if the packet is in the sampling period.

11:09.000 --> 11:15.000
This call to the aventry is still pretty expensive, even if you are not doing the perfe

11:15.000 --> 11:24.000
value event function to get the values. Instead, since we do this sampling inside the macro,

11:24.000 --> 11:29.000
which is pretty lightweight and pretty fast, we can have some better results.

11:29.000 --> 11:37.000
Almost reaching the baseline without no profiling attached. If we sample every 64 packets,

11:38.000 --> 11:49.000
let's say we reach almost the performance of a non-profile application while maintaining good enough results

11:49.000 --> 12:00.000
and accuracy. In this case, we are counting L1 cash misses. We get good results and expected results.

12:00.000 --> 12:08.000
So, to recap a little bit during our work, we identified the main sources of order at the specific profiler,

12:08.000 --> 12:14.000
which resulted to turn out to be the aventry, effects it, and perfe value.

12:14.000 --> 12:24.000
Then we develop this, so this function are necessary if you actually want to provide the application as easy,

12:24.000 --> 12:29.000
without modifying it, because you are looking around the application set.

12:29.000 --> 12:40.000
If you want to modify it, or you prefer any better performance, you can keep most of this function by using,

12:40.000 --> 12:49.000
calling the profiling itself inside from inside the XTP, and using K-function to access the PMC more efficiently,

12:49.000 --> 12:55.000
then we also implemented the sampling functionality to further reduce overhead.

12:55.000 --> 13:04.000
So, turns out that inspects, again, in this case, perfe, it's 71% faster without any sampling functionality,

13:04.000 --> 13:12.000
and 122% faster with sampling, sampling against the sampling velocity.

13:12.000 --> 13:24.000
But more importantly, we get 73% less instruction, no, it's at least while doing the test with instructions, which is pretty good.

13:24.000 --> 13:29.000
So, thank you very much for the attention, I'd be brief about that.

13:29.000 --> 13:34.000
Thank you.

13:34.000 --> 13:37.000
Good time for questions.

13:37.000 --> 13:39.000
Thank you for the talk, very interesting project.

13:39.000 --> 13:47.000
So my question is, if you can do the profiling mechanism like this for XTP profilers,

13:48.000 --> 13:51.000
can you adapt this to regular BPU programs?

13:51.000 --> 13:55.000
So, is the overhead basically the same or very different?

13:55.000 --> 14:00.000
Supposedly, we could, because the problem is being able to call data function,

14:00.000 --> 14:04.000
we should be able to call from every BPU program.

14:04.000 --> 14:11.000
The problem is that the performance gain are much higher in XTP,

14:11.000 --> 14:15.000
because the application gets called millions of times per second.

14:15.000 --> 14:22.000
So, a different BPU application might not get the same benefits,

14:22.000 --> 14:29.000
because probably it's called less, let's say, or if it's called the same number of times, it could be useful.

14:29.000 --> 14:31.000
Thank you.

14:31.000 --> 14:44.000
By the way, I love that B flying from the behave to the flower.

14:44.000 --> 14:46.000
Thank you.

14:46.000 --> 14:50.000
I was wondering, like instead of the K-Fank,

14:50.000 --> 14:54.000
did you also consider to implement that instruction natively,

14:54.000 --> 14:59.000
as a one-to-one mapping in BPU by extending BPU instruction.

15:00.000 --> 15:05.000
Because I mean, that would still give you even better performance,

15:05.000 --> 15:11.000
I would expect, and are you planning to submit some of that upstream?

15:11.000 --> 15:15.000
So, there is the idea of doing that.

15:15.000 --> 15:26.000
Maybe doing it like the GET timer function that's in BPU helper,

15:26.000 --> 15:30.000
so we can call this function and get the results.

15:30.000 --> 15:34.000
Without all the infrastructure around it,

15:34.000 --> 15:37.000
but of course the infrastructure is still a bit needed,

15:37.000 --> 15:42.000
so it becomes harder to do something like that.

15:42.000 --> 15:46.000
Have you been measuring the sampling overhead

15:46.000 --> 15:49.000
over the testing mechanism?

15:49.000 --> 15:52.000
What's the sampling mechanism itself?

15:52.000 --> 15:54.000
What's the overhead of it?

15:55.000 --> 15:58.000
It's pretty low.

15:58.000 --> 16:02.000
The overhead, let's see, this one,

16:02.000 --> 16:05.000
depending on the sampling rate you are doing.

16:05.000 --> 16:08.000
The black line is the baseline,

16:08.000 --> 16:13.000
so anything below it's our result,

16:13.000 --> 16:16.000
and this is the inspect without sampling,

16:16.000 --> 16:19.000
so we gain this much if we sample every.

16:19.000 --> 16:20.000
This is to be found.

16:20.000 --> 16:23.000
Something or it's with something one-to-one.

16:23.000 --> 16:25.000
This is without sampling,

16:25.000 --> 16:28.000
and this is by sampling every eight packets.

16:28.000 --> 16:32.000
Yes, but I want to know the performance if I add sampling,

16:32.000 --> 16:35.000
but I sample every trace,

16:35.000 --> 16:40.000
so then I know what's overhead of the sampling mechanism itself.

16:40.000 --> 16:45.000
Ah, okay, not so we didn't do this kind of test.

16:45.000 --> 16:46.000
Thank you.

16:54.000 --> 16:56.000
Hey, thank you for the talk.

16:56.000 --> 17:00.000
I was wondering if you know what the extra work

17:00.000 --> 17:02.000
that BPS per feed value is doing

17:02.000 --> 17:05.000
beyond just calling the native instruction.

17:05.000 --> 17:07.000
So it's mainly pretty heavy

17:07.000 --> 17:12.000
because it uses file descriptors to access these values

17:12.000 --> 17:15.000
as a reading of high let's say,

17:15.000 --> 17:20.000
and that's the main overhead that happens in this function.

17:20.000 --> 17:21.000
Thank you.

17:24.000 --> 17:27.000
Any more questions?

17:27.000 --> 17:31.000
I had a question.

17:31.000 --> 17:35.000
Is there any of the changes that could be brought to PF

17:35.000 --> 17:39.000
or to BPS to make them faster based on the work you've done?

17:39.000 --> 17:41.000
Yes, I know.

17:41.000 --> 17:46.000
The no-party is because since we are calling a key function,

17:46.000 --> 17:50.000
it's a relative to our machine,

17:51.000 --> 17:54.000
and it would be hard to ask someone else

17:54.000 --> 17:57.000
to include a key function inside the kernel

17:57.000 --> 18:01.000
and call a key function not knowing what it actually does.

18:01.000 --> 18:03.000
So that would be the problem.

18:03.000 --> 18:09.000
There could be a way of doing it without calling a key function

18:09.000 --> 18:15.000
but by simply calling the per feed value function

18:15.000 --> 18:19.000
inside the BPS program instead of calling the key function.

18:19.000 --> 18:24.000
So you can remove the per feed, the fendry

18:24.000 --> 18:27.000
and the effects it's overhead.

18:27.000 --> 18:30.000
But it's still pretty heavy to call that function.

18:30.000 --> 18:34.000
So the games would be marginal, let's say.

18:34.000 --> 18:35.000
All right, thank you.

18:35.000 --> 18:38.000
Have you run your tool in production?

18:38.000 --> 18:43.000
Have you used it so far mostly for experimenting?

18:43.000 --> 18:46.000
Or have you tried actually running that on production?

18:46.000 --> 18:50.000
No, it was just experiments like this.

18:50.000 --> 18:53.000
No actual production testing.

18:53.000 --> 18:56.000
It would be interesting to get some benchmarks on that.

18:56.000 --> 18:58.000
Okay, thank you.

18:58.000 --> 19:01.000
Someone there has a question.

19:01.000 --> 19:03.000
All right, thank you.

19:03.000 --> 19:04.000
Thank you.

19:04.000 --> 19:07.000
Thank you.

