WEBVTT

00:00.000 --> 00:19.120
All right, hello everyone, hello, hello, my name is Jacob, I work at CloudFur, thanks for

00:19.120 --> 00:24.000
joining me on this talk about attaching metadata to packets.

00:24.000 --> 00:29.240
Here's our agenda for today, we're going to first start off with what's our motivation

00:29.240 --> 00:35.000
behind this project, and then I'm going to take you through what's it's not quite a

00:35.000 --> 00:42.600
saga yet, but a trilogy where free attempts we had so far to get this upstream.

00:42.600 --> 00:48.040
First, quick, probably announcement, we're looking for a lot of interns this year, so

00:49.080 --> 00:54.680
tell your friends, look up the blog post that tells you all the details about the program.

00:54.760 --> 01:03.400
Right, so what are we trying to do? First of all, how many of you know what an SKB or a

01:03.400 --> 01:11.720
socket buffer in the Linux network stack is? Okay, okay, that's not bad. All right, so our goal

01:11.720 --> 01:17.960
is to allow users, Linux network stack users, and their BPS programs to be able to attach

01:18.920 --> 01:23.960
hundreds of bytes of metadata to their packets as they travel from an network stack to

01:23.960 --> 01:30.200
do different cool projects. Whereas today we're limited to just, you know, a few bytes.

01:31.640 --> 01:37.960
And this work is driven by a few very concrete use cases, we have in our production.

01:37.960 --> 01:46.440
This first one being that on our receive path, we need to reclassify the packets at every stage,

01:47.320 --> 01:53.400
and the configuration propagation time for each of these stages is a little bit different.

01:53.400 --> 01:59.400
So there's some inconsistency, trends and inconsistency that bring us operational pains.

02:00.760 --> 02:07.160
So we would ideally would like to classify the packet only once, early on, perhaps in our XDB program,

02:07.880 --> 02:13.400
take it with a service identifier, and then all the later stages of the network stack

02:13.480 --> 02:19.240
would just look up the service identifier and say, oh, I need to run rules for service X, right?

02:21.080 --> 02:26.040
Our second use case comes from the fact that we have many, many data centers,

02:26.440 --> 02:31.960
and we do traffic engineering between them sometimes. So if we're one of the data centers

02:31.960 --> 02:38.440
is getting overloaded, we're going to take part of the traffic and encapsulate it and forward it to

02:38.440 --> 02:46.200
another location to relieve the overloaded color. The problem that stems from that is when

02:46.200 --> 02:53.000
when we lock through which color, through which data center request came through,

02:53.000 --> 02:57.640
we get them misattributed, right? And that messes up our capacity planning.

02:58.920 --> 03:05.160
We could easily solve that if, because we know from which data center of the traffic was for

03:05.960 --> 03:11.880
that we could extract the information, attach it to an incoming request, and then if it could

03:11.880 --> 03:17.160
travel together with a packet up to the application that actually processes the request,

03:17.160 --> 03:21.400
the application could read that information out and log it and it will be golden.

03:23.960 --> 03:31.160
Our next use case that we have in mind is just making a service life easier.

03:31.160 --> 03:38.440
Cloud for in a great simplification is a proxy that sits between users, which recall eyeballs,

03:38.440 --> 03:46.360
these will be your web browser, your apps, calling KPIs, and we just forward the traffic to our

03:46.360 --> 03:52.200
customer service, which we call origins. Well sometimes we inadvertently drop packets,

03:52.920 --> 03:57.000
and it would be really helpful for us to know whose packets are we actually dropping.

03:57.880 --> 04:05.320
So in an ideal world where our proxies, when they are forwarding messages, they could label them

04:05.320 --> 04:13.000
with a customer identifier, put them on the transmit path, on the network stack, and then if the

04:13.000 --> 04:19.800
packet happens to get dropped, we could have a BF program attached to the function that goes to a

04:19.880 --> 04:24.920
packet that would just read out the customer identifier and count it log it.

04:26.680 --> 04:33.480
And our last use case probably most complex one is keeping track of how the requests are

04:33.480 --> 04:40.680
traveling through the network stack within one's one host. So inside our machines we will

04:40.680 --> 04:48.920
actually usually have a chain of proxies, and packets might get encapsulated, redout user space,

04:49.080 --> 04:55.160
kind of over a unique socket connection, and it will loop back across network namespaces.

04:55.160 --> 05:02.200
It's really hard to keep track how a very precise traveling. If we could tag it with some kind of

05:02.200 --> 05:10.920
trace and identify it with survive, this journey then it would make debugging problems much easier.

05:10.920 --> 05:21.720
All right, so how do we get about it? How do we attach metadata to packets? Well the first idea

05:21.720 --> 05:27.800
we had was called SKB Traits. But before I can talk about SKB Traits, we need to talk a little bit

05:27.800 --> 05:35.480
about the layout of the packet buffer. So when you have XDP programs attached to your Nick device,

05:36.200 --> 05:43.000
the driver will allocate a buffer for the packet data, but it will leave a bit of a

05:43.000 --> 05:50.280
headroom in front of the packet data for very stuff. It will be usually 256 bytes.

05:51.880 --> 05:57.720
Now this packet buffer later on when we allocate VSCB will actually become VSCB head buffer.

05:58.680 --> 06:09.720
But before it becomes that XDP will put its stuff in front and your XDP programs might

06:09.720 --> 06:18.920
write some custom metadata in front of the packet data. The problem with that existing

06:18.920 --> 06:26.360
metadata feature is that you can't access it, pass the TCPPF hook. And in addition to that,

06:26.920 --> 06:33.160
it will get just crumpled overcrop that as you push heavier on your packet.

06:33.720 --> 06:41.240
So our first idea was, hey, let's fix that by introducing a brand new area for metadata.

06:41.240 --> 06:47.400
We call it Traits. That would be accessible from all the stages of network stack.

06:48.040 --> 06:55.080
And it would be a new reference of the headroom. So it wouldn't be in the way of pushing new

06:55.080 --> 07:01.960
encapsulation headers. But the feedback we got from for that from upstream was that we don't need

07:01.960 --> 07:10.120
a second metadata. We already have one. It would be maintenance burden. So fair enough, but of course

07:10.120 --> 07:18.520
sadness. So but we've worked out and since we're told that we already have a metadata area,

07:19.240 --> 07:27.080
it's only made sense to look into whether we can make it work for us. So how do we make XDP

07:27.080 --> 07:35.640
slash XDP metadata work and not break any existing users? Well, to talk about that first,

07:35.640 --> 07:42.760
we need to learn a little how your BPS programs, TCPPF programs, access this metadata.

07:43.960 --> 07:49.640
They actually do, they'll do it by accessing it through that data meta pointer,

07:50.760 --> 07:56.280
which is not a real pointer that exists in your SKB. It's actually just a pseudo pointer,

07:56.280 --> 08:04.360
that we compute for the benefit of your BPS programs. And it's relative to the SKB data

08:04.360 --> 08:10.680
pointer, which points at the beginning of the packet payload. So what's the problem,

08:10.680 --> 08:16.600
well, as your SKB travels through the networks like layers. And we move the SKB data pointer,

08:17.320 --> 08:23.320
your data meta pointer, style of sporting in the at the wrong place, right?

08:24.840 --> 08:32.040
So to make this scheme work, we would need to keep copying bytes over every time we're

08:32.040 --> 08:41.400
in both SKB data pointer, which is not ideal. Solution, well, let's just hide away the fact

08:43.240 --> 08:49.480
how we access the meta data by introducing an abstraction. And in this case, a useful

08:49.480 --> 08:57.480
abstraction that we applied was BPSB pointer, which you can think of as a slice,

08:57.800 --> 09:07.560
which underneath has a pointer and a length. In our case, it was a pointer that was relative

09:07.560 --> 09:14.280
to SKB hat, or actually relative to the MAC header. Why relative to the MAC header? Well,

09:14.280 --> 09:19.800
that's simply because of the packet that Linux network stack already tracks the position

09:19.800 --> 09:22.600
of the metadata like that. So we didn't have to do anything.

09:22.680 --> 09:31.560
That got in. We made some progress. Then we had to fix some BPS helpers. I'm not going to

09:31.560 --> 09:41.320
talk about that, but we made some more progress. And then we encountered another issue. Well,

09:41.320 --> 09:50.120
if you have layer two tunnels in your network setup, guess what? We're going to change the MAC

09:50.120 --> 09:56.440
header offset as they encapsulate the packets. So now you're slice to your metadata, again,

09:56.440 --> 10:03.640
points to the wrong place. So what's the solution? Well, we can keep track of whatever metadata

10:03.640 --> 10:10.680
is located separately from the MAC header. We need to store more offset for that in SKB.

10:12.520 --> 10:19.240
And that's quite an ideal, because this allows us to relocate the metadata

10:19.880 --> 10:24.760
anywhere we want with the header. We could push it to the top of the header, which looks a lot

10:24.760 --> 10:34.520
of like threads, right? It's not ideal, because the data make a pointer is not going away.

10:35.080 --> 10:41.480
It's part of BPS API, and we still have to support it. So for some programs, we'll need to keep

10:41.480 --> 10:47.240
copying the metadata. But overall, we can make this work. So this is what we are proposed that

10:47.320 --> 10:54.680
metadata could just leave anywhere in the header. But that feedback we wrote, this feels like

10:54.680 --> 11:00.920
duct tape. This is one of solution, please use SKB extension instead. All right,

11:02.280 --> 11:12.600
sadness again. I guess not the XTP metadata, XTP, the SKB extension. So what is SKB extension?

11:13.480 --> 11:20.760
So adding new fields to your SKB socket buffer is a big no-no, because of performance reasons.

11:20.760 --> 11:28.760
So smart kernel engineers do the pointer at the end. That can point to an additional buffer

11:29.320 --> 11:38.360
where various features can store very stuff like MPTCP or crypto flows.

11:38.760 --> 11:46.760
So we could just store our stuff in an SKB extension, but probably not the metadata directly.

11:50.440 --> 11:57.080
We don't want to embed it in a SKB extension, because this buffer for SKB extensions gets allocated

11:57.080 --> 12:04.920
as soon as any of an extension gets activated, right? So if you're using MPTCP, but you're not using

12:05.880 --> 12:12.680
metadata, then you would have just wasted an allocation for VET memory, right?

12:15.240 --> 12:21.640
We could allocate space for metadata separately, and I just point to it from an SKB extension.

12:22.280 --> 12:30.200
But this feels just like another metadata, which we're already tried with trade, and it's

12:31.160 --> 12:36.840
we would, it feels like we would be repeating the same mistake. So let's go for something more

12:36.840 --> 12:44.200
general, and here's where BPS local storage comes in. So BPS local storage is a established concept

12:44.200 --> 12:52.200
with a BPS ecosystem and a Linux network stack, which allows BPS programs to attach

12:52.920 --> 12:59.480
arbitrary data to be common entities that we have, like C groups, tests, sign-out,

12:59.480 --> 13:06.360
sockets, and each of these objects has a pointer to BPS local storage, named one way of the

13:06.360 --> 13:15.320
other, that will actually point to a list, not a byte array, which will have a list of storage

13:15.400 --> 13:22.520
elements, and that's because you can have multiple users attaching their data to tests,

13:22.520 --> 13:30.280
C groups, and you want to accommodate them all. And the way your BPS programs access the storage

13:30.280 --> 13:38.280
is through a BPS map, abstraction, which identifies your storage and defines your storage geometry.

13:39.160 --> 13:48.520
So you can have like users of storage and two corresponding maps, right? So let's have a pointer

13:48.520 --> 13:57.880
to BPS local storage in our SKB extension, and your programs then can access it from new

13:57.880 --> 14:07.400
map type called SKB storage. And that's all neat, that needs our requirements, but to actually create

14:07.480 --> 14:13.160
such storage, you'll have three different locations that you need to do. And because of that,

14:13.160 --> 14:19.240
we expect it won't work out of the box for all the use cases we have, especially both that

14:19.240 --> 14:26.920
involve tagging or labeling every packet that passes for very C for transmit path. So expect

14:26.920 --> 14:32.760
like a long tail of optimizations before we can actually make this work for us.

14:32.760 --> 14:38.920
Number of less, I've tried it out, I've tried it out just this week, and a prototype seems to be

14:38.920 --> 14:45.080
working, passing some tests that implement some of these use cases, I've talked about. So please

14:46.040 --> 14:54.360
watch out for the patches appearing on the net depth and BFF mail in place. And we'll see,

14:54.360 --> 15:01.080
as this is this finally, when we get rich pocket map data, I guess only time will tell.

15:03.720 --> 15:09.880
All right, I'm leaving you with links to everything I prefer to hear. If you have any questions,

15:09.880 --> 15:15.800
feel free to shoot me an email. And once again, I'll remind her that we're looking for

15:15.880 --> 15:40.600
interns. Thank you. Any questions? Okay, I just want to dare. Thanks for your talk, just the

15:40.680 --> 15:47.320
remark, keeping eye on the number of extensions because we from the Ken subsystem are about to

15:47.320 --> 15:56.360
add one soon as well. And I think there was a limit of it or something. I'm not sure how much

15:56.360 --> 16:05.640
is used. So the nice thing and something that also needs improvement about extensions is that you

16:05.640 --> 16:12.600
can select which ones you include a build time, right? Yeah, but the conflicts are not very

16:12.600 --> 16:19.960
granular today. Like say you, you're not interested in using the crypto offloads because you don't

16:19.960 --> 16:25.320
have neck hardware that supports it or you just don't use it. Yeah, but you and all of

16:25.320 --> 16:31.960
yes, conflicts should work. And it should not break because nine bits do not fit in in you eight.

16:32.840 --> 16:40.200
So yeah, that's true. I mean, maybe it will come to that, but you know, you have to pick and choose

16:40.200 --> 16:46.360
with which extensions you build your kernel or maybe we just add more bits. I think we're still

16:47.960 --> 16:55.640
whole in the extension like header. So before that proceeds with chunks. Just just a heads up.

16:56.200 --> 17:03.960
Thanks. We were discussing if some extensions are mutually exclusive just use the same.

17:05.400 --> 17:06.920
Makes sense. Thanks.

17:14.120 --> 17:22.040
How about casual quality if you're having the two drivers, the pointless. I guess it's pretty painful for

17:22.120 --> 17:29.240
cashers. Yeah, I would imagine so as well. I don't have any branch marks as far as

17:30.200 --> 17:36.200
what overhead is. We just expected to be painful, but we'll have some benchmarks for RFC.

17:37.880 --> 17:45.800
But you know, we've tried doing it without allocations, reusing the free space and the pocket

17:45.800 --> 18:05.800
buffer. We couldn't make it work. Any other questions? All right, thank you. Thank you.