WEBVTT

00:00.000 --> 00:17.000
So, we get started again. The next talk is the budget talking about extending AFXDP for

00:17.000 --> 00:25.800
first collucated packet transfer. So, thank you. So, I am going to talk about AFXDP and some

00:25.800 --> 00:32.800
extensions that I did in my PhD work. This has been collaboration with many of the students

00:32.800 --> 00:39.720
who are working with my professors. So, let me just get started first. So, how many of

00:39.720 --> 00:45.720
you have used XDP? Okay, everyone. You know what it does right. First packet processing

00:45.720 --> 00:51.080
implement your own stuff runs in the driver very first. For example, if IP is something

00:51.080 --> 00:56.360
drop it happens right. But one thing that you might have experience is that there

00:56.360 --> 00:59.640
are many constraints. You cannot do everything. If you say, okay, I am going to hold

00:59.640 --> 01:05.240
off this packet and I will say like say in the packet later on you cannot do. There is

01:05.240 --> 01:11.360
not such things right. So, one other alternative that we have is AFXDP. So, here what

01:11.360 --> 01:19.080
we do is, so let me again ask, who have used AFXDP? Okay, fine. So, here what you do is instead

01:19.080 --> 01:24.640
of do the processing in the XDP layer, you just run a XDP program which will say, okay, instead

01:24.640 --> 01:28.040
of processing your, I will say in the packet to some other application that is running

01:28.040 --> 01:33.400
in the user space. And what it can do is that it can deliver the packet in zero copy

01:33.400 --> 01:38.200
manner. So, basically the driver can directly, the nick can directly delay in the packet

01:38.200 --> 01:43.400
into user space memory and it is really awesome. You can get like some good performance

01:43.640 --> 01:50.280
out of it. Now, what like we would like to do is that we can, we would like to have multiple

01:50.280 --> 01:55.480
such applications, zeros, zero copy applications, running in the same machine. Because

01:55.480 --> 02:00.120
of efficiency reason or you want to come up with a strategy where you, you are breaking

02:00.120 --> 02:04.600
application into smaller parts and you are then like joining them together to give some

02:04.600 --> 02:10.360
service right. This is the goal that we want to do. But the, as, as the problem is that

02:10.440 --> 02:15.240
being and loss of it, like if you want to transfer the packet, the packet goes out through the

02:15.240 --> 02:20.120
nick. You, you cannot say, okay, I want to send packet from application one to application

02:20.120 --> 02:25.080
to, it will go out, right. How do you solve that? So, what are the existing solution? You

02:25.080 --> 02:29.800
will say, okay, SRV, I am going to use the hardware support that is there in the machine,

02:29.800 --> 02:33.800
there in the nick and then I am going to use that to redirect the packet from one application

02:33.800 --> 02:40.040
to other application. But as it turns out, the support for AFXDP on such drivers is very

02:40.680 --> 02:46.840
scarce. For example, virtual function support is only available for Melonox MLX5 and

02:46.840 --> 02:52.600
sub-function support is only for ISN MLX5, right. And also, if you try to implement this in SRV,

02:52.600 --> 02:57.160
since there will be packet copies from one, one application to the other applications.

02:57.160 --> 03:01.720
The throughput will decrease because of, of such copies that will happen. Because ideally,

03:01.720 --> 03:05.960
if you have multiple such application and they are running in the same machine, if few

03:05.960 --> 03:09.800
somehow can share memory, between these applications, you can just pass point of

03:09.800 --> 03:14.760
one application to other application, right. So, other thing that you could have

03:14.760 --> 03:19.080
take a thing of like a possible solution is like, there are there are works in the PDK,

03:19.080 --> 03:23.960
like open adsim, which tries to solve this problem. By having a user space application,

03:24.680 --> 03:30.760
a manager kind of stuff, which orchestrates this packet transfer from one application to other

03:31.720 --> 03:37.080
but the problem is that now you, there is a very tight coupling between such libraries as

03:37.080 --> 03:41.640
well as the applications that people are going to make, right. And also, you have to change all

03:41.640 --> 03:46.360
this AFXDP, whatever applications you are making, you have to change them with new APIs, right.

03:47.080 --> 03:52.120
So, what we have done, what our work is that we have implemented packet

03:52.120 --> 03:57.240
reduction from one AFXDP application to other AFXDP application inside the kernel itself.

03:58.200 --> 04:03.320
Therefore, if we want, for example, if the app one wants to send packet to other app,

04:03.320 --> 04:07.800
the only thing that you do is just send, normally pay nothing else, internally the packet will

04:07.800 --> 04:14.200
be sent to the other application and we named it as flash, like fast linked AFXDP sockets for

04:14.200 --> 04:21.800
high performance training. Now, like, let me just give a brief idea of what this AFXDP thing is

04:21.800 --> 04:28.360
and how things are working, right. So, what is happening here is that the application has some

04:28.360 --> 04:34.280
user space memory, that is, that has been allocated and this allocated, the memory is being shared

04:34.840 --> 04:39.720
with the driver as well as AFXDP socket. So, this is where the DMA and other things are happening.

04:39.720 --> 04:46.280
And now to pass the control of where to DMA or you, the driver wants to say, hey, I have DMA,

04:46.280 --> 04:49.880
the packet here, please look there, they are like multiple drinks, single participation, single

04:49.960 --> 04:54.200
consumer drinks. For example, if you look at the purple lines, these are used for receiving

04:54.200 --> 04:58.520
packets and the orange lines that are present are used for like transmitting packets.

04:59.080 --> 05:05.160
And if you look at it, the transfer of packets is now split between user space and the kernel space

05:05.160 --> 05:11.880
where receiving and transmission is actually happening in user space and the processing or sending

05:11.880 --> 05:19.000
the packets out to the neck is being handled by the driver itself, right. And if we again go

05:19.080 --> 05:24.200
much deeper and look at how the redraction happens. So, first thing that it does, the user space

05:24.200 --> 05:28.200
application does is that it tells the drivers that, hey, these are the locations where you can

05:28.200 --> 05:33.400
DMA the packets by putting it in the filtering. And now, the driver says to the neck,

05:33.400 --> 05:38.520
you can do that. And whenever DMA happens, it goes and then the exhibit program runs, it says,

05:38.520 --> 05:44.760
okay, it looks up which AFXDP socket, I should send a packet now. If it sounds that and it

05:44.840 --> 05:50.840
puts the descriptor of that particular location into the RX ring, which the AFXDP socket uses to

05:50.840 --> 05:56.680
get to know, okay, here the packet has been received, right. Similarly, for transmission also,

05:56.680 --> 06:02.920
the PXing and completion rings are used, right. Interestingly, there are two application with

06:02.920 --> 06:08.360
the two threads that is working here, right, or the driver thread and the AFXDP user space thread.

06:08.360 --> 06:13.400
The driver thread is actually in truck driven, whenever packet comes or it tries to send the

06:13.480 --> 06:19.240
packet out, out through the neck, it is scheduled whereas, AFXDP socket is under your control,

06:19.240 --> 06:24.360
how do you, how do you want to run it. Now, there is a like, other kind of mode that,

06:24.360 --> 06:29.720
actually you can use, where the AFXDP itself can initiate the sending and reception of the

06:29.720 --> 06:35.320
packets by using send to and receive from system call. So, it kind of behaves like a busy polling

06:35.320 --> 06:41.160
that will happen, right. So, now, the idea that we want to implement is this, right. There is

06:41.160 --> 06:46.280
two application, whenever application one tries to send packet to the next application,

06:46.280 --> 06:50.920
we just want to put descriptor from the TX ring from one and put it in the RX ring of other

06:50.920 --> 06:57.000
application, but there are many challenges. First of it, first of it is the rings, right.

06:57.000 --> 07:02.200
The rings of AFXDP are single-produced single-consumer. So, if you just try to put the packet there,

07:02.200 --> 07:06.920
it will violate the semantic, right. So, first of all, what we try to do is, we try to

07:07.240 --> 07:13.880
use a single-produced single-consumer ring only by having a single case of target trade to process

07:13.880 --> 07:18.200
all of them to serialize them. We found out that the speed that we are getting where, where,

07:18.200 --> 07:22.680
very low, when we increase the number of sockets. So, instead what we went ahead was to implement

07:22.680 --> 07:27.480
the multiple, we updated the rings to have multi-producer, multi-consumer support.

07:28.280 --> 07:32.680
While doing so, we ensure that from user space, it remains single-produced single-consumer. So,

07:32.680 --> 07:37.160
what we did, if we look at the code here, earlier there was producer consumer and flags.

07:37.160 --> 07:44.680
We have added two more values here, producer head and consumer head. Now, these two are used

07:44.680 --> 07:50.680
to serialize multiple-producer or multiple consumers. And while doing so, we internally used

07:50.680 --> 07:55.800
components to have operations. And, and we have abstracted all of this implementation with some

07:55.800 --> 08:01.480
simple APIs, instead of AFXDP, some subsystem itself. So, for anchoring, you give the descriptor,

08:01.480 --> 08:07.720
it will thank you for taking you also, it will do that. So, now, if you look, if you have this

08:07.720 --> 08:12.680
system, right, now you, you can have like multiple-producer multiple consumer. Now, what you can do

08:12.680 --> 08:17.880
is that, if you look at the figure, when you try to transmit, you can check where, where to send

08:17.880 --> 08:23.400
next, you, you can directly put that buffer into the RX thing of the next socket.

08:24.120 --> 08:29.240
But, by doing so, what is going to happen is that application 2 will have one more buffer,

08:29.320 --> 08:34.680
and application 1, one is going to have one less buffer. So, there is like some mis, like

08:34.680 --> 08:40.120
misbalance of buffers between the applications. Now, to do that, what we do is that, we pull

08:40.120 --> 08:44.040
one buffer from the filtering and put it into completion ring. So, what is going to happen?

08:44.040 --> 08:48.600
From the perspective of application 1, it will feel as if I have sent, I have successfully sent

08:48.600 --> 08:54.280
one buffer. And from perspective of application 2, it will feel as if I have received a new

08:55.080 --> 08:58.840
buffer, from the, from the Nick, whereas, it just internally, in the same machine, it itself,

08:58.840 --> 09:04.120
the buffer has went from application 1 to application 2. And also, if this is, this is when you

09:04.120 --> 09:08.600
have a, like single human region, right, where you can just, like pass the pointer from one,

09:08.600 --> 09:13.320
application to other application. If you have two such applications with different human regions,

09:13.320 --> 09:17.560
what you will, you will do is, it will just do mammocopy internally, right, instead of doing just

09:17.560 --> 09:21.960
do passing. So, that support is also implemented in our system.

09:22.920 --> 09:28.600
Now, while doing so, we realize that, now, we are using, components of operations.

09:28.600 --> 09:33.560
And it, it was happening very frequently. The performance was like decreasing, like, from the

09:33.560 --> 09:39.000
baseline, the base implementation, the performance was well-low. So, what we try to do is that,

09:39.000 --> 09:44.200
we, we hold off whatever, the packets we are getting, we do not thank you. We hold off until

09:44.200 --> 09:49.880
for bad size, some bad size, which is, I think, we have used NaPipose bad budget size itself.

09:49.880 --> 09:56.120
And then, in a single, in a single turn, we push all these buffers into the, into the rings.

09:56.680 --> 10:01.000
And for that, also, we have introduced a new APA for that. Now, the problem is that,

10:01.560 --> 10:07.320
okay, I have 30, 30 buffers that I want to push into the ring. And you do not know, right,

10:07.320 --> 10:11.320
what will be the sizes of the ring in the future. But because we have updated these rings

10:11.320 --> 10:16.840
from single-producer, single-consumer to support multiple producers. For example, I have looked,

10:16.840 --> 10:22.280
okay, in RX ring, I have two buffers. If you, like, take that knowledge and try to,

10:22.840 --> 10:27.000
and keep two buffers into RX ring, it might happen that some other producer has already

10:27.000 --> 10:33.400
filled this RX ring. So, that problem, like, can occur. For doing that, what we have done

10:33.400 --> 10:39.880
is that, we have ordered the ring updates in a manner, that it will ensure that the PX

10:39.880 --> 10:44.440
packets that we are sending will always be guaranteed. It's like, the thing that we have done

10:44.840 --> 10:51.960
is already like, there is a TX batching option in FXDB. For implementing that, they have already

10:51.960 --> 10:56.840
like faced some kind of problem like this, but only for a single application.

10:56.840 --> 11:01.640
Now, since we have two such applications, it was way difficult to do that. We have taken the

11:01.640 --> 11:07.960
idea, make sure that the order that we are sending the packets, ensure that there is no packet

11:07.960 --> 11:13.320
drops. And while doing so, we have found out that we might have like pulled some buffers from

11:14.200 --> 11:19.240
filtering, and we cannot take in, and it might be like laying around in our hand, because,

11:19.240 --> 11:24.760
for example, if we pull three buffers, and only there are two, what to do with that extra

11:24.760 --> 11:28.760
buffer, feeling buffer that we have got. So, what we do is that, we hold off for the next

11:28.760 --> 11:34.040
iteration and then reuse that. And by, and we have, we have some checks and we have done that,

11:34.040 --> 11:39.400
and we have then, we have tested that, it ensures that no packet drops and other such things happen.

11:39.400 --> 11:46.840
So, there is no corruption or such things. The third thing that we face is that, so there are

11:46.840 --> 11:52.440
many applications and libraries like CNDP, which tries to employ some hybrid pulling mechanism

11:52.440 --> 11:57.560
to like increase the efficiency of FXDB application. So, ideally, what it does that initially,

11:57.560 --> 12:04.040
it starts the application in a, in a interrupt mode using pole system call. So, your application

12:04.040 --> 12:08.840
will not take up CPU cycles when, when no packets are coming. The moment packets comes,

12:09.560 --> 12:14.360
like comes coming, it transitions into busy pole mode, starts processing packets. And again,

12:14.360 --> 12:19.400
if there is no packets, it rewards back to the interrupt mode. But now, the problem is that,

12:19.400 --> 12:24.040
if you look at the figure, the packets are only coming in application 1 and application 3,

12:24.040 --> 12:28.680
the application 2 is doing nothing. There is no interrupts coming in application 2. And hence,

12:28.680 --> 12:32.200
the kernel will think that, okay, there is no processing happening, I will not schedule it.

12:32.680 --> 12:38.120
So, by default, if we implement whatever we want to do, it will, it is going to be messed up.

12:39.160 --> 12:44.760
The brings will get empty or some deadlock situation might happen. So, what we have to do is that,

12:44.760 --> 12:52.840
we have to implement, we have to implement a hybrid pulling mechanism, which also schedules the

12:52.840 --> 12:58.280
application whenever packets are redirecting from redirected from one application to other applications.

12:59.160 --> 13:04.840
Also, like, whenever there are applications which are sending packets from one NF to other NF,

13:04.840 --> 13:10.120
and the, if the second NF is congested and is not able to process packets, the other

13:10.120 --> 13:15.880
application is the first application. Continuously tries to send, like buffer again and again,

13:15.880 --> 13:21.560
waste wasting CPU cycles. So, also what we implemented was, senders can, if they see that,

13:21.560 --> 13:26.120
okay, my ring is full, I cannot send more packets than that. They can just do a pole system call

13:26.200 --> 13:32.280
with pole out, which will block the application. And when the, in the, in the receiver application,

13:32.280 --> 13:37.240
when there is, this congestion goes away, it can just call a receive from system call to just,

13:37.240 --> 13:41.640
inside, internally inside the kernel, just wake up the other applications that I am ready,

13:41.640 --> 13:45.800
I have now space, you can now try sending application. So, we did implement this also.

13:47.240 --> 13:53.400
Other implementation is that, I think it is, it is quite easy to do this.

13:53.480 --> 13:58.600
If you have multiple applications running across different containers, these are different

13:58.600 --> 14:04.920
namespaces, you cannot directly access the Q. What you can, you can do is that you can have some

14:05.640 --> 14:11.080
application running in the host namespace, which handles, passing the control of the,

14:12.680 --> 14:17.960
Nick to the application that is running itself. So, what we did was we implemented a flash monitor,

14:18.040 --> 14:23.400
which communicates with the applications running inside, inside containers using some,

14:25.960 --> 14:31.480
using some UDS rockets. And by doing that, you can initiate, create as well as share user

14:31.480 --> 14:36.520
memory between applications, across between applications that are running in different containers

14:36.520 --> 14:44.840
very easily. And it ensures that previous tasks and all accesses are like taken care of

14:44.840 --> 14:51.880
by these privileged applications. And also, you might ask that, if you have two different applications,

14:51.880 --> 14:57.560
which are, which is sharing memory, like application one can corrupt memory of other application,

14:57.560 --> 15:01.800
what, what we will do about that. So, what we did was we implemented a rust library. So,

15:01.800 --> 15:07.080
if the applications are using our rust library, it ensures that, if application one has sent

15:07.080 --> 15:13.080
a packet to other applications, then the control of that packet is gone, you cannot do anything

15:13.080 --> 15:19.560
about it. This is the fourth thing that we did. But the point is that, okay, you have this

15:19.560 --> 15:25.160
system of passing packets from one fxdb socket to other fxdb socket. But how do you specify,

15:25.160 --> 15:31.320
like which socket next, right? So, because these rings and all these terms that I am talking

15:31.320 --> 15:36.520
about are structures inside the kernel, right? And which is associated with some fxdb socket,

15:36.520 --> 15:41.560
which has this point like, struct xdb socket. And these are like process local, right? You

15:41.640 --> 15:48.920
have fds, fds points to this xdb socket, which then points to this rings, right? So, if, for

15:48.920 --> 15:56.440
example, fxdb socket 1, once to send fxdb, send packet to fdb socket 2, how do you, like get access

15:56.440 --> 16:01.720
of those rings, right? So, to understand first, like, how does xdb does that? We looked at how

16:01.720 --> 16:08.360
xdb does that. So, what xdb does is, is they have a map, which has some kind of index,

16:08.360 --> 16:15.080
these index can be like, user can specify what the index is, by default, it is QID and interface

16:15.080 --> 16:21.080
that you are using. And it actually stores this struct xdb socket. So, whenever a packet is coming

16:21.080 --> 16:26.680
to the device and to the Q, it looks up into the map, gets this ok pointer and then from there,

16:26.680 --> 16:32.040
it gets access of this rings, where it and Q's packets, right? So, now, we want to do the same thing,

16:32.040 --> 16:37.000
whenever socket 1, once to send packet to socket 2, we want to get access. So, first we thought

16:37.320 --> 16:43.320
ok, maybe we can have some xdb program running in, in the egress path, but before that we thought

16:43.320 --> 16:47.800
ok, let us go an implement a straight forward solution. So, what we did was we implemented a

16:47.800 --> 16:54.280
c surface interface. So, whenever a, a, fxdb socket is created, in this interface c's kernel

16:54.280 --> 16:59.720
flash, a index gets created. So, for example, socket 1, one is getting created, for two, two

16:59.720 --> 17:04.120
two is getting created. And then now, two set of redirection dosing that, I want to send packet

17:04.280 --> 17:11.640
from 1 to 2, you can just do a equal to 2 c's kernel flash 1 next. So, internally whenever

17:11.640 --> 17:17.480
now, these packets are getting transmitted, the map will be looked up, is it ok? Now,

17:17.480 --> 17:23.160
it is the packets need to go to like fxdb socket 2, it will get this structure that is stored

17:23.160 --> 17:27.880
in the kernel, get the access of the rings and then do this magic that is happening. So,

17:27.960 --> 17:34.200
that is how we are doing it. So, now, the point is that, all things that we have done like

17:34.200 --> 17:40.200
how much change do we need to make, it is like, is it drastic, does it change, what is the default

17:40.200 --> 17:45.640
path of fxdb? That is no, we are not changing anything, we are only augmenting stuff with

17:45.640 --> 17:49.800
whatever is there in the fxdb system, it is around 700 lines of code that we have added,

17:49.800 --> 17:55.240
most of it was for the c surface interface. And in the driver also it requires only minimal

17:55.320 --> 18:00.680
patching, because now, when driver checks if there are transmission, if there are packets

18:00.680 --> 18:04.520
ready for transmission, now you need to say to the driver, hey, there are packets ready, but you

18:04.520 --> 18:08.520
do not need to transmit it, you just do your stuff. So, the new lines that you have to add it

18:08.520 --> 18:14.920
are those 2 or 2 only. So, if there is no transmission, it can just return, I have done some work,

18:14.920 --> 18:19.800
the nappy should work as usual, but do not do anything, do not DM a packet out. So, these

18:20.520 --> 18:26.840
code is needed for supporting different drivers, we have tested our implementation in

18:28.440 --> 18:31.080
four different drivers, we have tested and it was working fine.

18:32.520 --> 18:39.640
And yeah, the performance gains, if we look at it, the axis says the number of applications

18:39.640 --> 18:45.240
that we can change together in 0 copy manner, and the axis says the throughput as well as the

18:45.240 --> 18:50.280
latency. As you can see, if we use the standard as a way, the performance decreases, because

18:50.280 --> 18:55.960
the copies and different stuff. Whereas, if we use our flash technique, then it stays almost

18:55.960 --> 19:01.480
similar. And if we compare it with other, like use a space training technique, like open

19:01.480 --> 19:08.360
medium, it is almost similar, where the throughput is almost similar, whereas the latency is

19:08.360 --> 19:13.080
really low compared to if you think about the latency that we are getting out of like open

19:13.080 --> 19:25.240
Ethereum solution. And yeah, in summary, I have shown you that AFXDP has a problem, when you

19:25.240 --> 19:30.200
try to have multiple AFXDP applications, colocate on the same machine, it cannot send packet from

19:30.200 --> 19:35.720
one socket to other socket. So, for solving the problem, we have extended the subsystem, so that

19:35.720 --> 19:41.000
redraction can happen. For that, we have updated the rings to support multi-producer multi-consumer,

19:41.000 --> 19:47.400
implemented batching buffering, smart polling for efficient packet transmission, and

19:47.400 --> 19:54.360
C surface interface to set up these redraction rules. For now, talking about what can we have

19:54.360 --> 19:58.920
this technology, what we can do with it. So, first thing is that we have now this C surface

19:58.920 --> 20:05.720
interface, I think if you think about programmatically, it is quite a bit strict, you can say

20:05.720 --> 20:10.040
that you can say in packet from one to another. If you want to say that, you can have

20:11.320 --> 20:16.040
application one can change from one to other application, as well as other application. You can do that,

20:16.040 --> 20:22.120
but the application now internally has to say to the kernel that I want to say in this packet

20:22.120 --> 20:27.960
to something else. So, what you would like is that the application will change the packet as usual,

20:27.960 --> 20:32.680
and whenever it comes to the kernel, the kernel can automatically redirect packet wherever you want.

20:32.680 --> 20:38.280
So, the answer will be obviously, a EBPF program, right. If there is a EBPF program at the

20:38.360 --> 20:45.400
address point, then you can decide this rule, right. For this particular packet that is coming

20:45.400 --> 20:52.360
instead of sending the packet out through Nick, I want to send this packet to other application,

20:52.360 --> 20:57.480
or you can say that, okay, I want to send to application C. So, that logic you can implement.

20:57.480 --> 21:01.880
There has been interesting talk about implementing HPE grace, there is a talk, there was a talk in

21:02.680 --> 21:09.800
LPC, I think this fits really well with this idea where they were trying to see what observability

21:09.800 --> 21:14.680
and other things that they can do it, right. Now, we can have this HPE grace hook point where

21:14.680 --> 21:19.480
this decision of sending the packet to the next NF can also be incorporated, right. So,

21:19.480 --> 21:26.680
I think it is really great. So, also we have tested this changing solution with different applications

21:26.680 --> 21:32.680
like load balancers, firewalls and other stuff, right. I think generalizing it to other other

21:32.680 --> 21:38.200
such applications can also be interesting, right. For example, you can have a dedos mitigation

21:38.200 --> 21:43.560
application. So, nowadays, IP tables are used. So, maybe I will replace IP tables with some

21:43.560 --> 21:48.440
0 copy applications that is actually doing the routing and then I will have this dedos protection.

21:48.440 --> 21:54.280
Maybe have some key value store or web server running with this entire stack and I can redirect

21:54.360 --> 21:58.840
packets whenever it is going through this stack, through this bypass stack, right.

21:58.840 --> 22:03.160
So, that can be very interesting. So, I would really like to listen from you like what do you think

22:03.160 --> 22:09.320
about this implementation. We are also currently developing a 0 copy support with TCP,

22:09.320 --> 22:15.320
MTC user space, MTCP. So, that we can also like deploy TCP applications over this whatever we

22:15.320 --> 22:20.680
are working on. And I think since we have updated the rings of AFXDP to support multiple

22:20.680 --> 22:25.880
production and multiple consumer. Now, like there can be different scenarios that people can

22:25.880 --> 22:32.040
think of. For example, whenever a packet has come to a very Q1, you can in the XGB program.

22:32.600 --> 22:37.000
Now, it can only deliver packet to whatever AFXDP socket it is bounded to, right. If it is

22:37.000 --> 22:41.400
if it is bounded to Q1, it can go there only. Now, because the rings are only support single

22:41.400 --> 22:45.080
predecessor single consumer. So, now, you have multiple producers. So, you can say, okay,

22:45.080 --> 22:49.720
from Q1, I want to send to Q2, those kind of things also open up because of the implementation.

22:49.720 --> 22:56.360
So, I also, I am really excited about what can we do about like using these single producers,

22:56.360 --> 23:02.920
single multi consumer and multi-producer single consumer rings in AFXDP. And also, for I talk

23:02.920 --> 23:07.320
about flash monitor little bit, I am also we are also working on rewriting the flash

23:07.320 --> 23:11.720
monitor as well. It was written in C, we are rewriting completely in in rust right now.

23:13.160 --> 23:15.160
Thank you. Any questions?

23:19.800 --> 23:33.160
Okay. Great work. I have two questions. The first one, how do you deal with the red

23:33.160 --> 23:37.720
direction of the, I mean the AFXDP rings that can be of different size, right. And then you could

23:37.720 --> 23:44.760
potentially one process redirecting to the other one could block, right. Like, have you thought about

23:45.320 --> 23:49.560
sorry can we be doing? I mean like the AFXDP rings that could be of different size,

23:49.560 --> 23:54.520
where different sockets, right. And if you redirect from multiple other sockets to a certain socket,

23:54.520 --> 23:59.160
it could like fully block the receiving socket. So, like have you thought about?

24:00.200 --> 24:05.720
So, what we do is that internally, when you create redirections, for example,

24:06.920 --> 24:13.080
A1s to redirect packets to B. We, in the in the kernel, we create some temporary buffers,

24:13.880 --> 24:19.880
which is size of socket 1. And when you try to redirect, it will, maybe you are trying to

24:19.880 --> 24:24.520
redirect the whole thing. It will, it is going to try to and keep all those packets. If it is not

24:24.520 --> 24:31.160
possible, then it will like ensure that other packets stays there. It will not block until this is done.

24:34.040 --> 24:41.880
And then my other question was, have you measured the overhead, like for like a regular AFXDP

24:41.880 --> 24:48.360
application with new multi-producer multi-consumer? Like how big is the overhead? And you mentioned

24:48.360 --> 24:54.680
in your talk that you gained the performance because of batching. Yeah. So, yeah, can you give

24:54.680 --> 24:59.480
some more details on that? Yeah, yeah. Actually, we did test it with with our

25:00.200 --> 25:05.560
multi-producer multi-consumer implementation. We did, we have done some experiments. The work actually

25:05.560 --> 25:10.520
has been published in SCC. In the paper, we have the results comparing it. And it was not much.

25:10.600 --> 25:15.080
The latency, I think the exact numbers I don't remember, but it is not, it is not

25:15.080 --> 25:19.640
drastic. It is almost similar. And it was because of the batching that we were able to

25:19.640 --> 25:29.720
like have the same numbers as normal implementations. Yeah. There is question.

25:32.520 --> 25:37.320
So, is this already like, is there a patch set for this? Are you going to upstream this to

25:37.320 --> 25:43.560
kernel or is it already in? Because I know, yeah, it is open source. So, we have two components

25:43.560 --> 25:47.400
with that web developed. One is the kernel changes that we have done. The other thing that we have

25:47.400 --> 25:52.600
done is that we have a like whole user space library where people can go build their application.

25:53.160 --> 25:57.720
That is a big thing that that I didn't discuss at all in this in this talk. Both of them are open

25:57.720 --> 26:04.600
source. We are looking, we are trying, we are in the in a, I will say timeline with talking with the

26:04.680 --> 26:09.480
people who are working in this field to see what they think about this, do they think that

26:09.480 --> 26:14.200
this is very important and should go into the kernel. So, I think coming up with applications

26:14.200 --> 26:18.760
that can benefit from this technology is also very interesting thing that I think we can

26:18.760 --> 26:21.800
thought think of. Amazing. Thank you. Yeah. Thank you.

26:25.800 --> 26:31.240
Have you considered about like just sending our FC patch set to the kernel milling days just to get some

26:31.320 --> 26:42.440
feedback as well? Yeah, I have been thinking about that. I wondered how you handle the performance

26:42.440 --> 26:51.080
when you go cross CPU because there is a lot of you lose a lot of performance when you do

26:51.640 --> 26:59.240
a cross CPUs. Yeah, definitely. I think because of batching of the performance,

26:59.240 --> 27:04.200
there is if you look at a number right, there is a decrease little bit decrease when you go from

27:04.200 --> 27:10.360
like one to two. It's not much, but there is a decrease. So, I think because of the way that we

27:10.360 --> 27:15.960
are doing the batching, the like the difference was not that much. I didn't discuss about batching

27:15.960 --> 27:20.280
a lot here, but yeah, because of the, I think because of the batching the difference was

27:21.960 --> 27:26.920
like we were able to catch up with the, when we were doing like components of a

27:26.920 --> 27:32.040
preparation across CPUs CPUs. Yeah. Any more questions?

27:39.800 --> 27:43.880
What is the platform name that you didn't discuss so far because I'm actually interested?

27:43.880 --> 27:48.920
With the name of the. No, not the flash one, but the other one. Sorry, I didn't

27:49.480 --> 27:53.400
have to develop the applications on top of it. That was also open source.

27:53.400 --> 28:00.040
Yeah. So, the user space thing is named flash only and the kernel space, I think we named it flash

28:00.040 --> 28:06.040
Linux kernel because we wanted to get like pushed into Linux kernel. So, we didn't name anything

28:06.920 --> 28:14.760
interesting for that. Thank you. Right. Thank you. Yeah.

