WEBVTT

00:00.000 --> 00:15.960
Okay. Welcome, everyone. I'm back. I'm Stefan Agarzarella. I work in the Redoubt virtualization team.

00:15.960 --> 00:21.960
And in this talk, I would like to go through all the ways we have in Linux, currently,

00:21.960 --> 00:28.640
to do to emulate the Vertejo device. This talk was mainly inspired by Thomas, because last

00:28.640 --> 00:34.840
year, I talked a bit about that. In another talk, here at Foss, I'm about Viewstuser.

00:34.840 --> 00:41.480
And Thomas was happy about the comparison between all the way we have to emulate Vertejo devices.

00:41.480 --> 00:50.600
So I decided to go a bit deeper and show all the pro and cons we have on all the way in Linux.

00:50.600 --> 00:57.160
So before going into the details, let's start about what is Vertejo. Maybe most of you

00:57.160 --> 01:02.680
already know everything about that. But essentially, is an open specification. The current

01:02.680 --> 01:11.360
version is 1.3. There is the 1.4 out for public review. And what they wrote into the

01:11.360 --> 01:18.040
spec is the purpose of Vertejo is a straightforward, efficient standard and extensible mechanism

01:18.040 --> 01:25.640
for virtual device. So essentially, they provide, they define a specification for Vertejo.

01:25.640 --> 01:32.280
What is what the spec covers? So essentially, there is the first part, they define the core

01:32.280 --> 01:38.640
component, which each device used to do, for example, feature negotiation and that kind

01:38.640 --> 01:46.080
of things. Then there is an interest pack for M, a part to define the initialization phase

01:46.080 --> 01:52.480
for the setup. They define all the transport that Vertejo device supports. For now, we have

01:52.560 --> 01:58.020
PCI, M, M, M, M, M, M, M, and Tonellaio. And then there is a big section for each device.

01:58.020 --> 02:05.600
So that's where a block and so on. About the core components, we can say the three main components

02:05.600 --> 02:13.980
are the control path, the data path, and a way to exchange notification. The control path

02:13.980 --> 02:22.180
is mainly used for feature negotiation. So driver and devices essentially agree on which feature

02:22.220 --> 02:27.060
they need to use. Because for example, we can have a new device with a whole driver that

02:27.060 --> 02:31.580
is not able to use the new feature. So essentially, they can also vice versa. So it can be

02:31.580 --> 02:37.700
a new driver and the whole device. So they do this kind of feature negotiation to agree

02:37.700 --> 02:43.060
on what to use. Then there is a configuration space, which is used for example, in an

02:43.060 --> 02:48.260
actor card, is used to set up the Mac address, in a V sub device, the device will tell

02:48.260 --> 02:54.260
to the guest, this is your guest, the CID, and that kind of things. And then the configuration

02:54.260 --> 03:00.260
part is used to set up the data path. Most of the, most of the memory is always allocated

03:00.260 --> 03:07.380
by the driver by the guest. So like the data path. And so the guests need to tell the driver

03:07.380 --> 03:14.540
need to tell to the device, say ER is the word queue. And so the word queue is essentially

03:14.620 --> 03:22.700
the main part of the data path. In the picture, I draw the split with queue. And in the

03:22.700 --> 03:28.460
spec, there is also a new version, which is called part with queue. About the split, essentially

03:28.460 --> 03:35.980
we have three memory areas. One is the descriptor table, which mainly contains pointer to buffers,

03:37.340 --> 03:42.860
with the length of the buffer, of course, flags, that kind of things. And then we have two rings.

03:43.420 --> 03:48.460
One is the available ring, and the other one is the used ring. The available one is used

03:48.460 --> 03:56.540
by the driver to put a century to expose to the device, a new buffer. So for example, new pocket.

03:57.100 --> 04:04.780
And the used one, instead is used by the device to tell to the driver, hey, I use the

04:04.780 --> 04:12.060
buffer. Now you can take it and clean it or reuse or whatever. And in this pocket with queue,

04:12.060 --> 04:17.260
everything is parked into the descriptor table, mainly for hardware, because we will see

04:17.820 --> 04:24.140
now we can, or even hardware are able to expose birthday of devices. And having three different regions

04:25.420 --> 04:32.060
can produce a lot of memory access, a lot of PCI transactions. So in order to reduce that

04:32.060 --> 04:39.340
content, they define a new format for the bit queue, which everything is parked into the descriptor table.

04:39.420 --> 04:44.700
It's a bit more complicated, but can work much, much better with hardware.

04:45.660 --> 04:53.340
And about notification, we have, we call kick the notification from the guest to the host,

04:53.340 --> 04:57.340
so from the driver to the device. And I'll queue, of course, the other way around.

04:59.660 --> 05:04.300
The devices. We have a lot of devices to define it into the spec. It's an open spec. So if you

05:04.620 --> 05:11.740
want to define new device, you can easily open send a serious to define a new one.

05:12.380 --> 05:17.580
The main one, we have a network card, birthday on that, a block device, birthday of block,

05:17.580 --> 05:24.460
V-sock, virtual socket. The first talker about a lot about V-sock, we have a birthday of V-sock,

05:24.460 --> 05:32.780
which support the AF V-sock address memory. We have birthday F-s for shared folders and others.

05:35.260 --> 05:41.900
Okay, let's go into the topic of the talk. This is just an overview, then we will go

05:41.900 --> 05:52.380
in each way. So the first way, common way to do device emulation is to do it inside the V-m-m.

05:52.380 --> 05:59.900
So inside Q-emo, this is kind of the standard way, and also one of the key point of V-m-m is exactly

05:59.900 --> 06:04.700
to do device emulation. And Q-emo, in our case, provide a lot of birthday of devices.

06:06.060 --> 06:12.380
Then we can, in another way, we have is to move the device emulation into an external user's process.

06:12.940 --> 06:18.860
The main purpose is for security and reliability, but as we talk about in Rasmium,

06:18.860 --> 06:27.020
MAM is also for flexibility. We can write in a different language and other reason. In Linux,

06:27.340 --> 06:33.980
for now, we have two way. We have the V-s teaser, which is the one I told before in Rasmium MAM told,

06:33.980 --> 06:37.740
and V-dias, which is V-dPA in user's space. We will talk a bit more about that later.

06:39.020 --> 06:45.180
Then, another way to do, to emulate the birthday of devices, are into the host kernel.

06:45.180 --> 06:51.980
And this is mainly for performance, because we are inside the kernel. We save a lot of

06:51.980 --> 06:59.500
system calls, and currently we have two ways. One is the V host devices, and the other way is

06:59.500 --> 07:05.260
a V-dPA software device. I mean, the V-dPA one is a potential one. For now, we don't have any

07:05.260 --> 07:10.700
V-dPA software device implemented into the kernel, but we did some proof of concept, so it's kind of

07:11.500 --> 07:20.460
doable. And the last one is to move the device completely into the hardware, and this was the main reason

07:20.460 --> 07:25.980
for V-dPA. Again, for performance, scalability, and that kind of things. So, going back to the

07:25.980 --> 07:33.260
first way, the V-dia device emulated by the V-m-m. As I said, as a commotionary, we can say,

07:33.260 --> 07:40.540
Quimo is the factor reference implementation of V-dia devices. You can find a lot of them,

07:40.540 --> 07:47.580
not all of them, but most of them implemented in Quimo. And why to do that? Mainly, simplicity,

07:47.740 --> 07:53.980
as I mentioned, I mean, a role of a V-m-m, one of the key role of V-m-m is device emulation. So,

07:53.980 --> 08:00.460
you have all the API to access the guest memory to inject interrupt and that kind of things.

08:00.460 --> 08:06.380
And that's the portability. If you write a device in Quimo, essentially you can run the device

08:06.380 --> 08:15.020
on all operating system that Quimo supports. Drawbacks are a reliability risk, so your device

08:15.100 --> 08:22.860
is in Quimo, a potential bug in your device can affect Quimo at all, and your V-m can go down.

08:24.140 --> 08:31.900
Another is the performance overhead. So, for example, if you have a natural

08:33.020 --> 08:41.100
card and a driver need to send a pocket. So, essentially the virtual unit driver will copy a

08:41.100 --> 08:49.660
pocket into the V-q, then send a kick. So, as a V-m-m-m-sit, KV-m will write into the I-o-event of

08:49.660 --> 08:56.620
D to wake up Quimo or the I-o-trad. And then there is the device running in Quimo. We'll access

08:56.620 --> 09:03.980
the V-q, pick the pocket, and maybe send it to a top device, so you need to do a write. So,

09:03.980 --> 09:11.260
again, a system call, go back to the card now, send the pocket, then return from the system call.

09:11.260 --> 09:18.060
When the pocket is sent, Quimo, the device implementation will move the descriptor from,

09:18.060 --> 09:24.460
we will put the descriptor into the used ring, and then need to inject an interrupt. So, again,

09:24.460 --> 09:30.140
another system call to write into the RQFD, to tell KV-m, hey inject an interrupt for this device.

09:30.220 --> 09:35.900
So, there is a lot of a lot. A couple of system calls for each pocket that the device needs to do.

09:38.780 --> 09:45.660
So, how to use a virtual device in Quimo? This is a simple example, where essentially we have two

09:45.660 --> 09:53.820
blocks that have option, which are part of the Quimo block layer, and we are telling to the

09:53.820 --> 10:02.220
Quimo block layer, open the Fedora, QQF2 image, and use a QQF2 protocol on top. And then we can link

10:02.220 --> 10:11.260
this QQF2 image to a device. In this case, is a virtual block PCI device, but we can use even

10:11.260 --> 10:18.460
NVMe, emulation, or a virtual escazi, or any other kind of block device that Quimo emulates.

10:18.460 --> 10:27.500
So, we need, we just need to change the device here. V-host. Okay, this was initially introduced

10:27.500 --> 10:34.140
exactly to improve the performance of a virtual unit device. As I mentioned, when you need to send the

10:34.140 --> 10:40.860
pocket, you need to do several C-school. So, what they did essentially is to move the device emulation

10:40.860 --> 10:48.060
into the kernel. The control part is still intercepted by Quimo, and the kernel will expose

10:48.060 --> 10:56.700
a couple of Y-o CTLs, that the Quimo's mold device can use to essentially overflow to the kernel,

10:56.700 --> 11:03.980
the device emulation. So, mainly, where are the virtues in our access them? And at this point,

11:03.980 --> 11:11.180
and the data part will completely be handled by the device into the kernel. Currently, in Linux,

11:11.180 --> 11:18.380
we support three device, V-host-nath, V-host-cazion, V-host-v-sok, why mainly performance, as I mentioned,

11:18.380 --> 11:24.860
having the device into the kernel device, for example, to talk with the top, don't need to do C-schools.

11:24.860 --> 11:29.500
So, we will see those kind of C-schools, or reduce latency, and improve the throughput.

11:30.220 --> 11:37.180
But, for other use cases like V-sok, the other reason is being into the kernel, we are really,

11:37.820 --> 11:44.060
it's more easy to interact with the kernel layers, like the socket layer. So, in the case of V-sok,

11:44.060 --> 11:51.260
we will need to open a socket into the kernel, and having the endpoint of the connection

11:51.260 --> 11:59.340
into the kernel will make us pretty much easier to instantiate a new socket for AFV-sok.

12:00.060 --> 12:05.980
Drawbacks. Yeah, of course, we want to go for us, so we need to pay something,

12:05.980 --> 12:09.580
and there is a security and the ability to risk. Essentially,

12:10.540 --> 12:17.980
a bug into the device, if in Qem, we can crash your VM, ER can potentially crash your entire host,

12:17.980 --> 12:23.900
and security again, there is an attack surface into the host kernel. So, yeah, we need to be careful.

12:24.540 --> 12:31.260
Upgradability, if you want to provide a new feature, you need to update your host kernel,

12:33.500 --> 12:38.220
portability, this is Linux only, it's only implemented on Linux kernel.

12:39.020 --> 12:43.580
How do you use it? It really depends on the devices, for V-sok, it's very, very easy,

12:43.580 --> 12:50.220
all the complexity is eaten by Qem, so you just need to have a V-host V-sok device,

12:50.860 --> 12:54.300
and specify your guest ID, then Qem will set up everything,

12:54.300 --> 13:00.620
we'll create the device for the guest, and then do the ACTLs to upload the emulation to the kernel.

13:01.900 --> 13:05.820
V-host user, I already talked a bit about that in the previous talk,

13:05.820 --> 13:12.140
in Rust VMM, but essentially, the V-host user was inspired by V-host, but instead of having the

13:12.140 --> 13:18.300
device into the kernel, the device now is moving to another process, another user space process,

13:19.100 --> 13:23.260
completely separated from the VMM and from the kernel into the host.

13:23.980 --> 13:29.900
The control part here is a unique domain socket, because we need to move some file descriptor

13:29.900 --> 13:36.940
around, mainly for the shared memory, so Qem needs to allocate the guest memory in a way that

13:36.940 --> 13:41.820
that memory can be had just by a file descriptor, and then pass the file descriptor to the device

13:41.820 --> 13:47.980
using the unique domain socket, and the data part at this point at that point will be handled

13:47.980 --> 13:54.140
completely by the device accessing the guest memory. So why to do that?

13:55.340 --> 14:00.140
Yeah, essentially to mitigate the security and the reliability risk of V-host, because now it runs

14:00.140 --> 14:07.820
completely in another user space process, so if that process can be put in a sandbox, for example,

14:07.820 --> 14:14.700
to avoid any escape, or can any crash, can crash just that process, we can easily restart

14:14.700 --> 14:22.060
the new process again, doing the emulation. Another advantage is upgrade ability, so if you

14:22.060 --> 14:27.660
want to update your device, you just need to run a new process and potentially just upload plug

14:27.660 --> 14:37.100
to your VM, and flexibility. We talked about it in the Rust VMM tool. With the V-host user,

14:37.100 --> 14:41.900
we can essentially write the device in any other language, like we did in Rust VMM, we wrote

14:41.900 --> 14:49.580
them in Rust, and use it in Quimo pretty easily. Draw back a bit more resources, use it,

14:50.300 --> 14:56.460
because of course we have other processes. All devices will run in another process, a bit more

14:57.260 --> 15:06.300
resource consumption, and more coordination. We need to start all the devices first,

15:06.300 --> 15:11.260
then start Quimo. So there is a bit more coordination, but this could be hidden by

15:11.260 --> 15:19.180
management layer like LibVirt, which is used essentially for that. Portability, I talked about

15:19.180 --> 15:25.260
that last year was of course Britain on Linux, but it's not really Linux specific, is any

15:25.260 --> 15:32.460
POSIC system can run a V-host user, because the only thing we need is Unity DomainSugget,

15:32.460 --> 15:38.460
and a way to allocate shared memory address by a file descriptor, and in POSIX, we have

15:38.540 --> 15:47.180
the SHM Open C-School that can be used. How to run it? Okay, it's a bit more complicated,

15:47.180 --> 15:54.940
so in this case, I use it as an example, the Quimo Storage Demon, which is an standalone application

15:54.940 --> 16:01.100
that provides exactly the same block layer of Quimo, but then can expose the device, the block

16:01.100 --> 16:07.260
device in several ways, and one way is a V-host user block. So this two lines are exactly the same,

16:07.260 --> 16:13.260
I use it in the first example, so I'm telling to the Quimo Storage Demon block layer, which is

16:13.260 --> 16:23.740
the same of Quimo. Hey, open the Fedora Cucough2 image, use the Cucough2 format on top, and then export

16:23.740 --> 16:30.620
this image as a V-host user block device on this Unix DomainSugget, on the TMP V-host, this socket.

16:30.700 --> 16:38.060
And then we can start Quimo, open the Unix DomainSugget, attaching it to the device,

16:38.060 --> 16:44.300
to the V-host user block PCI device. One thing we need to use is the memory backend,

16:44.300 --> 16:48.380
a memory backend, which is shareable. So the memory backend essentially is the

16:48.380 --> 16:55.420
way how Quimo allocate the guest memory, so if you don't use it, if you don't specify

16:55.500 --> 17:00.700
the default is ROM, I think, memory backend ROM, or something like that, but essentially Quimo

17:00.700 --> 17:06.780
will do just the mall lock, and which is not okay for the host user, so we need a shareable

17:07.580 --> 17:14.060
memory backend, and the MMFD is for the by default is shareable, because it's the the memory

17:14.060 --> 17:19.420
is essentially adjust by a file descriptor, and we can share the file descriptor to the Unix DomainSugget.

17:19.820 --> 17:23.660
This is Linux only, so if you want something portable, you can use the

17:24.300 --> 17:32.700
memory backend SHM that we recently introduced in Quimo 9.1 last year.

17:34.780 --> 17:42.140
Okay, last one, BDPA, which is Vertio data path acceleration. In this, VDPA was mainly

17:42.140 --> 17:47.500
introduced for hardware, but we will see we can do also software accelerator with BDPA.

17:48.460 --> 17:54.700
In the case of VDPA, the data path must be Vertio compliant, so the device need to understand

17:54.700 --> 18:02.540
the virtues and how to elaborate them. The vendor path, sorry, the control path could be

18:02.540 --> 18:08.300
vendor specific, so in the case of the hardware device, we need a little driver into the kernel

18:08.300 --> 18:15.020
for the control path, and why? Mainly, again, for performance, we are floating the entire device

18:15.660 --> 18:21.500
to the one example is smart nix, so this smart nix now are some of them are able to expose

18:21.500 --> 18:28.860
Vertio devices, so we will save several cycles of the CPU to do the emulation, because the

18:28.860 --> 18:34.780
device will be directly exposed to the to the guest, and the VertQ share directly from the guest

18:34.780 --> 18:44.860
to the hardware. Yeah, physical, they need to, I mean, some of them are implemented with arm

18:45.740 --> 18:52.380
cheap, and they need to be able to understand the VertQ, so the driver essentially will tell

18:52.380 --> 18:57.820
to the smart to the hardware, hey, here is a VertQ, please use it for your device to do the emulation.

18:59.100 --> 19:03.820
It's not very common, you need to buy, I have the driver, the slide later, but yeah, it's not

19:03.820 --> 19:11.900
really common. Another advantage is essentially we support with this, we support both VAM,

19:12.220 --> 19:20.300
workload, but even container, because VDPA, we have a concept of VDPA bus into the kernel,

19:20.300 --> 19:26.060
where we have the Vhost VDPA bus, so if you are touch the device to the Vhost VDPA bus, the device will

19:26.060 --> 19:35.020
be exposed in similar way of the Vhost, the second one we show, so the QEMO needs to use

19:35.420 --> 19:40.940
Iosity else to set up the data, but exactly in a very similar way, but we have also another bus,

19:40.940 --> 19:48.460
which is the Vertio VDPA bus, so if you attach your device to that bus, the driver will be loaded

19:48.460 --> 19:53.580
into the kernel, into the host, so the VertQ will be allocated by the host, so you can attach

19:53.580 --> 19:59.420
your device directly into your host and use for bare metal application or containers.

19:59.660 --> 20:10.460
Another advantage is the unified software stack for VDPA, so as I mentioned, it was introduced

20:10.460 --> 20:17.260
for hardware, but we can write even user space devices, it's called VDOS, which is similar to Vhost

20:17.260 --> 20:26.220
user, but requires some coordination, some models into the kernel, and we can also have the

20:26.220 --> 20:32.220
devices into the kernel itself, like Vhost, as I mentioned, for now we don't have any device

20:32.220 --> 20:39.420
like that, but we have VDOS, I have an example later about VDOS, the advantage of VDPA that

20:39.420 --> 20:45.660
all of them, all of these three ways to implement a VDPA device are exposing exactly the same

20:45.660 --> 20:52.540
way to QEMO, so QEMO hardly need to support Vhost VDPA, and then for QEMO point of view, it doesn't

20:52.620 --> 20:59.580
matter really how the device is implemented, drawbacks, portability, this is Linux only, so you need

20:59.580 --> 21:05.900
things into the Linux kernel, it's not like the Vhost user, which is completely agnostic, so it's

21:05.900 --> 21:13.340
really Linux specific, and about maturity, it's still working progress, so we is supported by few hardware,

21:13.340 --> 21:22.540
not a lot, we support few, I mean for now I saw just net and block supported, has a birthday

21:22.540 --> 21:28.700
you devices, and of course the cost, you need to pay, if you want the other, if you want the

21:28.700 --> 21:34.940
device into the hardware, you need to pay for the hardware of course, and this is example, again

21:34.940 --> 21:42.220
using QEMO's storage demon again, because QEMO's storage demon can also expose VDOS device,

21:42.220 --> 21:47.740
so the first two line again are exactly the same, but we are turning to QEMO's storage demon,

21:47.740 --> 21:54.700
they expose your devices as a VDOS block, and with the name, then we can use the VDPA tool to

21:54.700 --> 22:01.020
instantiate really the device, and at that point we will find a CISFS entry that will tell us,

22:01.020 --> 22:09.260
hey this is a Vhost VDPA 0, and so to QEMO we can just use the Vhost VDPA device PCI with the

22:09.260 --> 22:26.940
tractor device, where QEMO can do the AUSTL to set up the VIRTQs, and yeah that's it, thank you, any question?

22:40.220 --> 22:46.060
The question was about if we can use a bpf essentially to do the device emulation into the

22:46.060 --> 22:52.060
kernel, can can be an idea, I mean currently we don't have any of kernel device VDPA kernel device

22:52.060 --> 22:55.900
implemented, but this could be potential one way to do that.

22:55.900 --> 23:13.180
The question was why not just passing all the PCI device to the gas to do, I mean you can do that,

23:13.180 --> 23:19.420
the main difference is in this case, I mean in that case your entire PCI device will be

23:19.420 --> 23:26.060
assigned to the gas here, so you need to support all the PCI stuff, here the device can be

23:26.060 --> 23:32.780
emulated, can be implemented in any way, that only thing you need, of course you need, I mean

23:32.780 --> 23:41.820
your device can implement several of this virtual function, and the advantage here is that

23:41.900 --> 23:49.580
it expose a virtual device, so in your application you just need, you don't need to change anything

23:49.580 --> 23:55.500
into the gas, because the gas is already able to have, I mean hopefully I have a virtual

23:55.500 --> 24:01.980
driver, can just use it, so you don't need your driver for your PCI device into the gas.

24:25.500 --> 24:41.340
Yeah the question was, several years ago already were VDPA out, so why it's not so popular now,

24:41.340 --> 24:45.980
I have no idea honestly, but I mean maybe because the hardware vendor want to

24:45.980 --> 24:53.500
lock in, maybe I don't know, or maybe there are some kind of bottleneck when using that, I don't

24:53.580 --> 25:00.300
know, I mean I'm not working on VDPA, but sorry we don't have time anymore, but if you have any

25:00.300 --> 25:05.980
question I will be out. Thank you.

