WEBVTT

00:00.000 --> 00:24.720
So, I'm Magnus. I'm working in Azure Core on the upstream team. In the upstream team,

00:24.720 --> 00:37.480
we've worked mostly on container-related open source technologies like Kubernetes, OCI, etc.

00:37.480 --> 00:46.640
But during the last year, we also worked on Qemu. I want to talk a bit about the work that

00:47.640 --> 00:56.640
we did in this project. There's an audio issue.

00:56.640 --> 00:59.640
I'll just try to increase the volume.

00:59.640 --> 01:12.640
I can also try to speak louder, but that's fine. One, two, three. One, two, three.

01:13.640 --> 01:23.640
It's better. Okay. So, yeah. First, maybe it makes sense to talk a bit about what Qemu

01:23.640 --> 01:31.640
accelerators are. So, this obviously simplification, but I think it still makes sense to

01:31.640 --> 01:40.640
dump down things a bit to provide context on what this feature is about. So, an accelerator is

01:40.640 --> 01:50.640
essentially the term in Qemu that describes the interface of handing off work to the

01:50.640 --> 02:00.640
virtualization extensions in the CPU. And we've seen this chart like there's, if you look at the VM,

02:00.640 --> 02:07.640
it's not really easy to describe what the VM actually is, but we can maybe use as a kind of working

02:07.640 --> 02:16.640
definition. There's a bunch of CPUs like this indicated here that there's multiple of those entities

02:16.640 --> 02:26.640
and there's memory. And from the host, we map memory into the guest. It's kind of, we projected

02:26.640 --> 02:36.640
it as guest memory, we contributed to the guest. And we also assign certain holes to it. And

02:36.640 --> 02:46.640
the holes are then caught by the high provided by the CPU as a memory access traps. And then the VM

02:46.640 --> 02:54.640
can, to device emulation, for example. So, there's more variants to, to, to, I owe, but this is

02:54.640 --> 03:01.640
maybe the most common one. And you see it's like the actual work in the VMM is actually

03:01.640 --> 03:10.640
this cycle. So, you, you pass off work to the CPU, then it's handed off to the high provided,

03:10.640 --> 03:17.640
this patches it, then to the actual CPU and under certain circumstances, then we'll end up

03:17.640 --> 03:25.640
all things a lot. We end up getting a trap in the guest that the guests itself cannot handle

03:25.640 --> 03:37.640
and we can yield back to the VMM and the cycle continues. Also relevant is maybe a quick discussion

03:37.640 --> 03:44.640
about what the difference are between KVM and Microsoft Hypervisor. So, let me call MSHV. They

03:44.640 --> 03:57.640
are both type one hypervisor, so they run on bare metal. But, why KVM is kind of a part of the kernel.

03:57.640 --> 04:05.640
And embedded in the host kernel, MSHV actually is a very thin layer. So, I don't work on this part,

04:05.640 --> 04:12.640
but people tell me it's really very thin. And the host OS, what we call it, it's really just

04:12.640 --> 04:22.640
a good partition. So, it's a privileged guest that's the main difference here. And when you

04:22.640 --> 04:29.640
look at this topology of how guests and, and hypervisors are organized, then there's also

04:29.640 --> 04:38.640
the option to run the root partition on the Microsoft Hypervisor. And they like in practice

04:38.640 --> 04:47.640
this looks quite like KVM for VMM. So, there's an MSHV device that's controlled with

04:47.640 --> 04:56.640
IOCTLs and file descriptors. And this has been used for quite some time already in an

04:56.640 --> 05:05.640
KS product called Potcent Boxing, where Qatar is used together with Cloud Hypervisor on

05:05.640 --> 05:10.640
Azure Linux kernel. So, it wasn't part of the mainline Linux kernel yet. It was developed

05:10.640 --> 05:17.640
on a separate Azure Linux tree for quite some time. And it's using the Rust VMM integration

05:17.640 --> 05:25.640
that we just learned about. There is the option to run this on bare metal. There's the

05:25.640 --> 05:34.640
option to run it with a Nestetypervisor or direct virtualization, which I described in a

05:34.640 --> 05:43.640
later slide. But essentially, I think the only part that will be really relevant for the

05:43.640 --> 05:50.640
user for the if you want to have a pure open source stack is probably direct virtualization

05:50.640 --> 05:58.640
because for the Nestetypervisor, you would run Microsoft Hypervisor, I was part of the

05:58.640 --> 06:04.640
UV protocol or something. The first part of this have been upstream since

06:04.640 --> 06:17.640
2015 and incrementally more features are being added. The rational behind QEmo Linux and

06:17.640 --> 06:24.640
MSHV is essentially that there's a lot of us from customers to sandbox and trusted code.

06:24.640 --> 06:30.640
I mean, it was obvious for functions as a service. For example, it becomes a lot more

06:30.640 --> 06:36.640
critical when people actually start using those coding agents and they would create

06:37.640 --> 06:43.640
a lot more on your system. So there's just a lot more untrusted code that we are in

06:43.640 --> 06:52.640
those circumstances. And MSHV provides us with the Microsoft term. It's called like

06:52.640 --> 06:59.640
hostile multi-tenancy. Basically means you cannot assume anything about the neighbor or

07:00.640 --> 07:07.640
the tenant being not doing something malicious. So this really a hard isolation layer

07:07.640 --> 07:14.640
that is essentially also used to run Azure virtualization products. So customers

07:14.640 --> 07:20.640
will want to build those platforms, virtualization and Linux is a very popular.

07:20.640 --> 07:28.640
Most popular workload on Azure when you look at the course and QEmo is just the most popular

07:29.640 --> 07:35.640
and comprehensive VMM with a lot of device patterns etc. So customers not only want to run cloud

07:35.640 --> 07:42.640
native workloads, but maybe also more exotic ones. And eventually QEmo is just all the building

07:42.640 --> 07:48.640
block for higher level virtualizations like Libert and QEmo. So if you want to use your

07:48.640 --> 07:54.640
custom hypervisor, you would have to build integrations for those. So this is why we ended up

07:54.640 --> 08:03.640
saying we need to have an integration in QEmo. And what I mentioned earlier is there's

08:03.640 --> 08:13.640
one option that this kind of virtualization technology gives us with Microsoft type advisor

08:13.640 --> 08:21.640
that we can have a direct virtualization. I think the marketing term that was coined

08:21.640 --> 08:28.640
before it was called hierarchical virtualization, which is bit opposite, but it just depends on

08:28.640 --> 08:35.640
where you look from the angle you look at it. Because for the customer it's direct virtualization

08:35.640 --> 08:42.640
from the virtualization perspective on the technical stack it's hierarchical because what we

08:43.640 --> 08:51.640
think is the basic idea is that if you have a root partition, then you don't need to actually

08:51.640 --> 08:57.640
use nested virtualization like this. You can instead just have a logical guest that is actually

08:57.640 --> 09:11.640
a sibling. So you inherit the resources and you're able as a customer to launch

09:11.640 --> 09:17.640
virtual machines in your virtual machine, but they're not nested. You don't need a hypervisor

09:17.640 --> 09:24.640
stack. They actually just use resources from your allocated resources and create a sibling.

09:25.640 --> 09:32.640
And this enables certain use cases. For example, device assignment that is pretty hard

09:32.640 --> 09:43.640
before, so you can build virtualization solutions that have device assignments to your VM

09:43.640 --> 09:55.640
and then you can assign it further to your virtual ed2 guest. And it's also quite useful to

09:55.640 --> 10:04.640
avoid this kind of words, which is that you have to do if you have layers of hypervisors

10:04.640 --> 10:14.640
then you have to like the cycle that I showed in an earlier slide. This gets quite expensive

10:14.640 --> 10:20.640
if you have to do this kind of round trips over and over through multiple layers. So there's

10:20.640 --> 10:26.640
a perfect text that you pay when you do nested virtualization and this one or the goal is

10:27.640 --> 10:34.640
basically to avoid this. So this has been announced for some months now and it should

10:34.640 --> 10:50.640
enter public preview soon. There's like several integration points in Qimo that that we

10:50.640 --> 11:00.640
kind of follow the KBM. Example, or try to model, I think the MSH need drivers

11:00.640 --> 11:10.640
also written in a way that it's quite easy for VMMs to use the same facilities as KBM.

11:10.640 --> 11:22.640
So a lot of aspects are, I wouldn't say, trivial to it to port but it was like easy to

11:22.640 --> 11:33.640
identify where the facilities should be implemented. Like for example, we have the requirement

11:33.640 --> 11:45.640
in MSHV that we have to do MMIO instruction, decoding an instruction emulation. Because

11:45.640 --> 11:53.640
in the earlier slide I showed that if you trap into the VMM and you need to emulate for example

11:53.640 --> 12:03.640
and MMIO access, then you have to have run those instructions in software essentially. And

12:03.640 --> 12:13.640
colleague basically looked at Qimo and saw that HVF, like the Apple hypervisor framework already

12:13.640 --> 12:19.640
has something like this in place. So we were able to generalize this concept and extend it a bit

12:19.640 --> 12:27.640
to what we need to do. And this worked quite nicely and I don't think it was very intrusive. We were

12:27.640 --> 12:35.640
also reusing all the acceleration parts. So this is basically the cycle that we've seen earlier.

12:35.640 --> 12:45.640
This is for real workloads not super practical because you want to avoid essentially

12:45.640 --> 12:55.640
to enter the VMM or to avoid the M exits and to avoid context switches etc. And so there's

12:55.640 --> 13:05.640
a lot of optimization parts that have developed in KBM and Qimo to move the data plane

13:05.640 --> 13:20.440
outside of the VMM. Like we see here, the like one typical chain of events. So we see like the

13:20.440 --> 13:27.240
guest driver for virtual IO driver for example writes to some NIO address. The CPU will

13:28.520 --> 13:35.160
detect as a page table violation so it cannot map this. We have a VM exit and then the

13:35.640 --> 13:45.960
kernel can look is there maybe an event FD registered. And if that's a case, then we can without

13:45.960 --> 13:51.960
the VM exit, we can essentially trigger in a different threat, in a dedicated threat, we can

13:51.960 --> 13:58.200
trigger the processing of this event in a device backend and when the device backend is done,

13:58.200 --> 14:04.440
it can do the opposite, it can write to an iqfd that is registered again in the host kernel and then

14:04.600 --> 14:11.960
it can inject an interrupt that then ends up in the guest interrupt service routine and complete

14:11.960 --> 14:17.720
operation for example. So in this picture, the VMM is not involved anymore and this speeds up

14:17.720 --> 14:30.440
things tremendously. Also for VIFIO, I think we are mostly able to use existing KBM abstractions

14:30.440 --> 14:40.920
and kind of call them Excel accelerated iqfd. So we have to probably do some renaming there,

14:40.920 --> 14:49.320
but essentially it work pretty well. So I have a few minutes still to talk a bit about the

14:49.320 --> 15:01.880
challenges that we're facing. So one of those challenges is VIFIO, so there's existing VIFIO

15:01.880 --> 15:13.240
support in cloud hypervisor and it kind of it follows the same principle you essentially have to

15:13.720 --> 15:26.120
set up a bridge between the VIFIO device and iqfd chip via event of deal and then the interrupt

15:26.120 --> 15:35.720
schedule routed from the past through device to the guest properly. The problem that when we try to

15:35.720 --> 15:45.160
hook this up facing well that iqfd were dropped and it wasn't really working until the devices

15:45.160 --> 15:53.240
reset itself and then eventually after some debugging turned out that the sequence at the MSHV

15:54.200 --> 16:05.640
driver expects is just very subtly different than of different from what KBM does. So just the order

16:05.640 --> 16:14.680
of which component to your create first and how do you wire them up. The end result is the same.

16:14.680 --> 16:23.320
So we have to see how we address this. I think we have options for this kind of problem we could

16:23.320 --> 16:32.760
have just a specific MSHV pass which is maybe not desirable. You could adjust the driver behavior

16:32.760 --> 16:38.280
maybe I think I'm not super sure this works because KBM is maybe a bit less of a distributed

16:38.280 --> 16:46.120
system than then MSHV is because we have still this hypervisor outside the kernel. We could also

16:47.640 --> 16:53.400
maybe ask whether the QE would behave or could be adjusts. So I just experiment around and move the

16:53.400 --> 17:01.800
sequence around and things work for KBM and MSHV. So this is something that we have to address.

17:03.000 --> 17:14.600
Another challenge is migration or even life migration. So this is not initially obvious when you

17:14.600 --> 17:23.240
develop the MSHV because you make imperative commands to an API and the MSHV created and a lot

17:23.240 --> 17:29.320
of magic happens in the background. When you want to take a VM or basically create a safe state

17:29.320 --> 17:36.920
of a VM and kind of reproduce it somewhere else then you start to notice that there's a lot of

17:36.920 --> 17:47.160
implicit state even in X86. Guests and why you initially don't really have to care about a lot of

17:47.160 --> 17:54.440
them like I know ABX 500 12 registers etc. So those kind of things don't play a role for

17:54.520 --> 18:00.600
example and then I owe emulation. But they play a role definitely when you want to migrate

18:00.600 --> 18:08.040
like the kernel with panic. For example, the PU registers are not really in the format

18:10.520 --> 18:17.800
they were before the context which and we need to then figure out how we

18:17.800 --> 18:28.920
capture serialized and round trip this whole state. So this has been quite challenging because

18:28.920 --> 18:36.200
also Kimo has not just the block serialization format we actually want to instruct for example

18:36.200 --> 18:43.480
the PU registers that are stored in X-Safe or MSRs and an additional problem with

18:44.440 --> 18:53.160
parabotularized devices like synthetic interrupt controller that is used on Microsoft Hypervisor

18:53.160 --> 18:59.320
and also by Linux guests. So there's a long-tailed list of subtle hard to reproduce issues and I think

18:59.320 --> 19:05.160
it will be hard to tell when you're done with this so the challenge really is here to define

19:05.640 --> 19:14.600
like a bar that you meet and then you can advance. This is like one of the challenges we have at

19:14.600 --> 19:22.840
the moment and quickly just what's upcoming what we didn't have for a while was like proper

19:23.480 --> 19:31.560
CPU model support so in Kimo you can tweak the CPU features extensively and this wasn't really

19:31.640 --> 19:39.320
properly reflected in the guests so far so this is something that we fixed device pass through

19:39.320 --> 19:46.520
we'll come the life migration and also colleagues started to work on arm and I think

19:48.360 --> 19:50.360
that would be it from my side.

19:57.640 --> 19:59.640
Questions?

20:02.520 --> 20:13.800
You're quickly compared to K-VM in the beginning and yes and another excellent and

20:14.040 --> 20:22.760
you call VHPX I think. How does that one compared to the source of Windows?

20:22.760 --> 20:33.560
Yeah so that's the question about VHPX this is it's using it's more like HVFI would say because

20:33.560 --> 20:41.160
it's using a high level API to create VMs under Windows so this one is MSHV is a Linux

20:41.240 --> 20:51.480
driver so that the root partition is on Linux this is the difference it uses like VHPX I think

20:51.480 --> 20:58.040
can't do live migration and certain other aspects that you that you can do in MSHV it's a bit more

20:58.280 --> 21:00.280
high level.

21:09.480 --> 21:18.680
Not so far because yeah like like the VHPX is essentially like a Windows technology and the

21:19.000 --> 21:27.640
host in quotes is Ninox and the workloads we are expecting is mostly a virtual IO and

21:29.000 --> 21:36.760
like guests Windows guests for example would install virtual IO drivers but yeah maybe in the

21:36.760 --> 21:46.040
future this would be extended so there's one more question.

21:48.680 --> 22:04.120
Yeah so for the hard to reproduce bugs whether I come to the snap shutting machine I mean

22:04.120 --> 22:10.600
what I have this on a slide quickly so what I ended up with essentially writing also my

22:10.600 --> 22:18.360
only bugger workloads which ignore like there's no assets like bare metal x86

22:19.240 --> 22:24.360
program that start very simple and then you start adding features MSR etc and then you see

22:24.360 --> 22:33.480
what breaks and this has been like very useful so I've been using one rust twill library or at that twill

22:34.360 --> 22:40.200
which you can create this stuff like serial console programs on bare metal easily it's

22:40.200 --> 22:49.640
was one of the more fun aspects of this work okay I don't see any other question

22:51.160 --> 22:53.640
then I think we're good thanks a lot for attending

23:03.480 --> 23:05.480
you