WEBVTT

00:00.000 --> 00:08.000
Thank you very much to the best of you.

00:08.000 --> 00:11.000
Welcome to my talk.

00:11.000 --> 00:16.000
The agenda for the talk today is for those of you who don't know NOVA.

00:16.000 --> 00:23.000
I'll give a very brief architecture overview and explain what NOVA is and what runs on top of it.

00:23.000 --> 00:28.000
Then I'll talk a little bit about the form of verification given update on the current status

00:28.000 --> 00:33.000
and what NOVA is one of those microcrongals that is undergoing from the verification.

00:33.000 --> 00:38.000
And then we get to the main topic of the talk which is scalability.

00:38.000 --> 00:42.000
You just heard from Michael that scalability is an important topic.

00:42.000 --> 00:47.000
And in NOVA there's a lot of work going on improving the scalability.

00:47.000 --> 00:53.000
And today we will focus primarily on supporting thousands of devices,

00:53.000 --> 01:01.000
this thousands of interrupts on big iron server machines and the new microchannel APIs that require that.

01:01.000 --> 01:05.000
And hopefully we'll have some time for questions at the end.

01:05.000 --> 01:11.000
So the architecture overview of NOVA is shown on this slide.

01:11.000 --> 01:16.000
So you have the microchannel running at the highest privilege level.

01:16.000 --> 01:20.000
In our case it's called the microhypervisor because it's not just the microchannel.

01:20.000 --> 01:22.000
It is also a hypervisor.

01:22.000 --> 01:27.000
And it's roughly 15 to 20,000 lines of code depending on how you count.

01:27.000 --> 01:35.000
And then on top of it you have a componentized multi-server operating system running of all these colorful boxes

01:35.000 --> 01:42.000
and every box represents a component that runs isolated in its own protection domain or its own address basin.

01:42.000 --> 01:46.000
And NOVA is a microhypervisor because some of these components,

01:46.000 --> 01:51.000
shown as these yellow boxes at the top, are actually used on mode virtual machine monitors.

01:51.000 --> 01:55.000
And there's an instance of the MM for every virtual machine.

01:55.000 --> 02:01.000
And you can run unmodified guest operating systems, unique kernels,

02:01.000 --> 02:06.000
container run times, offices, whatever you want inside those virtual machines.

02:06.000 --> 02:11.000
And the whole stack is undergoing formal verification.

02:11.000 --> 02:19.000
And for those of you who are very interested in the topic of writing specifications for each of the functions

02:19.000 --> 02:24.000
and mathematically proving those, there was a talk at the microchannel bedroom last year,

02:24.000 --> 02:31.000
given by my colleague Hay Dang, who talked in great depths about the form of verification that we do in NOVA.

02:31.000 --> 02:35.000
So for those of you who are interested, I suggest you check that out.

02:35.000 --> 02:46.000
NOVA is written in C++20, so formal verification attaches a specification to every C++ function.

02:46.000 --> 02:56.000
And then there is proofs for those functions that mathematically verify that each function conforms to its specification.

02:56.000 --> 03:04.000
And in a NOVA case, this is actually a very complex endeavor because NOVA is a highly concurrent microchannel.

03:04.000 --> 03:12.000
And when you have functions accessing the same state concurrently, the state space can explode.

03:12.000 --> 03:16.000
But we are using separation logic, which makes this problem tractable.

03:16.000 --> 03:23.000
And the approach that we are taking is actually cutting at the system called or helper call layer.

03:23.000 --> 03:25.000
And we're very fine from there, top-stown.

03:25.000 --> 03:33.000
So you can read this table, which I give me to provide an update on the progress that we're making.

03:33.000 --> 03:39.000
And at the lower, at the left side, you have a list of all the helper calls in NOVA.

03:39.000 --> 03:45.000
And which of those already have a full specification.

03:45.000 --> 03:57.000
And then we're rolling down into the stack until at the very right side we arrive at capabilities, memory allocators, timers, coroner memory, and so forth.

03:57.000 --> 04:06.000
It's an ongoing effort. And there's a lot of links for the NOVA specifications and the proofs.

04:06.000 --> 04:13.000
And one of the things that happened in the last year is that we actually made those specs and proofs open source under QPRV2.

04:13.000 --> 04:19.000
So you can go to the top link there and you find all that and you can see it.

04:19.000 --> 04:25.000
The next steps that we're doing in form of verification is that we actually look at weak memory reasoning.

04:25.000 --> 04:32.000
Reasoning about when one core does store, what is the point in time when all other cores have seen that store.

04:32.000 --> 04:40.000
And also interaction with the hardware, this is the page workers in the MMU and this is TLDs that requires a lot of complex modeling.

04:40.000 --> 04:46.000
But I don't want to go into the details here because the subject of the talk is not from the verification.

04:46.000 --> 04:50.000
The subject of the talk is scalability.

04:50.000 --> 04:57.000
But currently runs on the two major architectures. It runs on on the ADA called ARX64.

04:57.000 --> 05:07.000
And it runs on the X8664 bit architecture. And it scales all the way up from the smallest embedded devices.

05:07.000 --> 05:13.000
So in the case of ARM, I've just listed the Raspberry Pi as an example of a smaller embedded device.

05:13.000 --> 05:17.000
All the way up to the largest servers that you can find in cloud providers.

05:17.000 --> 05:30.000
So the largest you've currently tested is an AWS Graphitone 4 which has 192 cores, lots of memory, lots of SMMUs, lots of interrupt controllers.

05:30.000 --> 05:35.000
And it actually has 104 PCI devices.

05:35.000 --> 05:44.000
The same on X86, we have one binary that you can build that runs all the way from smallest X86 machines.

05:44.000 --> 05:49.000
So we like to use internal looks in these small form factors.

05:49.000 --> 05:56.000
All the way up to the latest granite records here on 60 views, also with 192 threads, lots of RAM.

05:56.000 --> 06:07.000
450 is something PCI devices, 10 IOMUs, 192 CPUs, five level pitchable, so really the big iron boxes.

06:08.000 --> 06:12.000
And when you compile a normal, it's very simple.

06:12.000 --> 06:21.000
Assuming you have a cross compiler for the architecture that you care about or you compile it natively, you type make arc equals whatever your architecture is.

06:21.000 --> 06:26.000
And in the ARM case, you also have to specify that you want a CPI as the board.

06:26.000 --> 06:32.000
And then you get a binary which is roughly 100 kilobytes in size.

06:32.000 --> 06:36.000
So really a micro kernel, not some big bloated piece of software.

06:36.000 --> 06:42.000
And the same binary works across the entire architecture stack from small to large.

06:42.000 --> 06:52.000
What I actually want you to take away from this slide is that the machines that we run on are now so large that we need really hundreds of devices.

06:52.000 --> 07:01.000
And if you assume that every device just needs one interrupt which is not true, many devices nowadays need tens of interrupts, tens of messaging and interrupts.

07:01.000 --> 07:12.000
But the number of interrupts that traditional micro kernels have provided in the past is no longer sufficient. You really need to have thousands of them.

07:12.000 --> 07:15.000
So what do we need to do?

07:15.000 --> 07:23.000
And you just heard from Micheal that we gained a lot of isolation and a lot of robustness from running device drivers in user mode.

07:23.000 --> 07:28.000
And I want to take a quick look at what it takes to run a device driver in user mode.

07:28.000 --> 07:32.000
And what infrastructure a micro kernel has to provide.

07:32.000 --> 07:41.000
So first of all, the code and the data of the device driver, you can put that in a user mode protection domain just like an application.

07:41.000 --> 07:46.000
But that application that device driver application needs access to the device registers.

07:46.000 --> 07:53.000
And in modern platforms, this is memory map.io on some all the internal machines you also have sometimes pop-based.io.

07:53.000 --> 08:01.000
And this is simple to achieve. You take the MMU, you map the memory map registers of the device into the address space of the application.

08:01.000 --> 08:04.000
And then you can drive them with normal load install instructions.

08:04.000 --> 08:07.000
So that functionality is trivial to implement.

08:07.000 --> 08:09.000
But then it gets a little bit harder.

08:10.000 --> 08:15.000
You need to make sure that this device driver can actually receive interrupts.

08:15.000 --> 08:19.000
So you need to convert the hardware interrupts into some sort of notification mechanism.

08:19.000 --> 08:24.000
And when the device driver is done doing the intrap handler work, the interrupt needs to be acknowledged.

08:24.000 --> 08:32.000
So you need to have some handshake in protocol with the kernel because you cannot give it a device driver access to the intrap controller for security reasons.

08:32.000 --> 08:36.000
That needs to be under the control of the micro kernel.

08:36.000 --> 08:43.000
So the way no bus stops this is it transforms an occurrence of an interrupt on an app operation on a summer floor.

08:43.000 --> 08:49.000
And when the device driver is done, it dounds the summer floor, which automatically acknowledges the interrupt.

08:49.000 --> 08:55.000
And that's a very nice and very flexible cross-core and cheap notification mechanism.

08:55.000 --> 08:57.000
So that's the interrupt problem.

08:57.000 --> 09:04.000
What we also need to do is we need to control the access for device drivers which we got to the MA.

09:04.000 --> 09:08.000
You cannot have a device driver program in arbitrary memory address.

09:08.000 --> 09:11.000
DMA to that address and scribble all over the kernel memory.

09:11.000 --> 09:16.000
So you need to have some access control mechanism, some hardware enforced access control,

09:16.000 --> 09:22.000
where the use of the MA or the addressable range for the MA is restricted,

09:22.000 --> 09:28.000
such that the device drivers can only DMA to memory or from memory that belongs to them.

09:28.000 --> 09:35.000
And this problem has been solved in many architectures with the concept of an IOMU or system MMU,

09:35.000 --> 09:40.000
where you can have a translation from DMA addresses to hospital addresses.

09:40.000 --> 09:46.000
So this type of DMA remapping, but you need to design also interfaces for that and we'll talk about that.

09:46.000 --> 09:53.000
And in another case, there's a hyper-core called assign-dev, where you can assign a device to a DMA space.

09:53.000 --> 10:02.000
And the DMA space is effectively a page table that represents an address space that you can assign to a device to restrict

10:02.000 --> 10:06.000
what region of memory it can DMA to or from.

10:06.000 --> 10:14.000
And you also need to provide access to interrupts because not every device or every device driver should be able to trigger any arbitrary interrupt.

10:14.000 --> 10:21.000
You want to say this device can trigger this particular interrupt or this particular set of interrupts,

10:21.000 --> 10:29.000
but no more you shouldn't be able if you are the next driver, you shouldn't be able to assert an interrupt that belongs to a storage driver or the GPU.

10:29.000 --> 10:38.000
So you need an API that transforms a global system and route that not just to a summer for an interrupt summer for,

10:38.000 --> 10:45.000
but also to a particular CPU because not all of these thousands of interrupts should hit the same CPU core,

10:45.000 --> 10:49.000
you actually want to send them out through many cores.

10:49.000 --> 11:01.000
And the hyper-core that does this is a sign-in, where you can say assign this particular hardware interrupt to this device and route it to this particular CPU and configure it.

11:01.000 --> 11:10.000
So I want to explain a little bit how the DMA translation part works because not many people know all the details in the,

11:10.000 --> 11:17.000
and I'll use the arm SMUV3 as an example. So in arm SMUV3 is starting at the top left corner,

11:17.000 --> 11:23.000
the DMA stream that comes from a device is identified by a stream ID.

11:23.000 --> 11:32.000
And inside the SMMU, there's a multi-level stream table, which is set up by the normal microcom or micro hypervisor,

11:32.000 --> 11:44.000
and every device has a stream table entry that points to a DMA pitch table, which controls what regions of memory that device can target.

11:44.000 --> 11:52.000
And then the SMU also has a command queue, where Nova is the producer, so it puts commands into the queue,

11:52.000 --> 11:59.000
and the SMU is the consumer, it reads commands of the queue, and there's an event queue which works in the opposite direction,

11:59.000 --> 12:06.000
where the SMU puts commands in, and the events in and Nova reads events are also, for example, the SMU could say,

12:06.000 --> 12:13.000
there was a DMA remapping error, where some device tried to access some region of memory that it couldn't access.

12:13.000 --> 12:23.000
The problem with the SMUV3 is that it uses stream table entries, which are 512 bits in size.

12:23.000 --> 12:28.000
So if you do the mass, that is 864 bit word.

12:28.000 --> 12:34.000
You do not have a processor instruction that sets up a stream table entry automatically.

12:34.000 --> 12:42.000
You have to write these 64 bytes in a particular order, and you're effectively racing with the SMMU against those writes.

12:42.000 --> 12:49.000
So the spec specifically says that the SMMU may read all those 8 bytes in any arbitrary order.

12:49.000 --> 13:00.000
The question is, how do you actually set up a stream table entry in such a way that the SMU always sees a consistent state and never some partial or corrupted state?

13:00.000 --> 13:05.000
And you need to design a protocol for that, and the protocol works as follows.

13:05.000 --> 13:17.000
So we start in the state, in the invalid state, where the valid bit of a stream table entry is zero, which means whatever the SMU has fetched, whatever configuration will not be considered valid,

13:17.000 --> 13:19.000
and it won't act upon it.

13:19.000 --> 13:26.000
And the moment that the valid bit becomes one, whatever configuration it has set will be used.

13:26.000 --> 13:36.000
So the first thing is we need to, or it's clear, that we actually need to set the valid bit last, because that's what ultimately commits the entry.

13:36.000 --> 13:44.000
So we are not just racing against the SMMU, we are also racing against the same call being made from other CPUs,

13:44.000 --> 13:50.000
which may also try to configure a stream table entry. So we need to arbitrary to win this race.

13:50.000 --> 13:57.000
So the first thing we do is we automatically set the configuration in the stream table entry to something other than zero, with an atomic store,

13:57.000 --> 14:03.000
and whoever wins that race can proceed and everybody else will get an aborted error code.

14:03.000 --> 14:09.000
And then once we've set up the configuration, the valid bit is clear, SMMU won't use the entry.

14:09.000 --> 14:13.000
We have to make that configuration observable.

14:13.000 --> 14:22.000
And that means in ARM case, that we need to make sure that the stream table walker and SMMU can see it, that if the SMMU is not coherent,

14:22.000 --> 14:29.000
that we push those stores all the way out to memory, out of the cache hierarchy, that the SMU can actually see it.

14:29.000 --> 14:35.000
And then, because the SMMU may have already read some of those other words, this configuration words,

14:35.000 --> 14:38.000
we have to tell the SMMU to forget this entry.

14:38.000 --> 14:44.000
So this means we've written the valid configuration, be tell it to forget whatever the invalid configuration was,

14:44.000 --> 14:48.000
and then we set the stream table entry to a valid bit to one.

14:48.000 --> 14:56.000
We make that also observable, but the SMMU may have still cached valid equal zero, so we have to invalidate the STE again.

14:56.000 --> 15:03.000
So you see this very complicated dance, you have to do to make the SMU actually see all the changes in the right order.

15:03.000 --> 15:10.000
And likewise, if you want to remove an entry, you have to automatically clear the valid bit,

15:10.000 --> 15:16.000
and to ever clear the valid bit first, wins the race to unconfigure the entry.

15:16.000 --> 15:25.000
And then you can undo all the remaining configuration, but you also have to invalidate the entry at SMMU,

15:25.000 --> 15:32.000
and to invalidate the TLD, which means forget any translation entries from those DME page tables that were set up using this entry.

15:32.000 --> 15:39.000
So this is just an example of the complicated synchronization you have to do in a highly concurrent microchannel,

15:39.000 --> 15:45.000
where you're not just racing among the 192 CPUs, but you're also racing with the hardware.

15:45.000 --> 15:51.000
The good thing is that the bad thing is that whatever locking scheme you may have in mind,

15:51.000 --> 15:54.000
the hardware doesn't know anything about the locks in your software.

15:54.000 --> 15:59.000
So you have to be atomic between the software and the hardware.

15:59.000 --> 16:06.000
But the moment you get that right, you get the synchronization between the cores for free.

16:06.000 --> 16:13.000
So let's look at the interrupts system on ARM, because I said we want to scale to thousands of interrupts

16:13.000 --> 16:18.000
and we need to understand what types of interrupts exist and how to configure them.

16:18.000 --> 16:26.000
And in the old ARM world, let's say before gig version 3 was introduced,

16:27.000 --> 16:31.000
the interrupts up the system was split into several components.

16:31.000 --> 16:36.000
Starting at the top left corner, we have a gig distributor, which has many incoming lines.

16:36.000 --> 16:40.000
So these are wired interrupts, which ARM calls shared peripheral interrupts.

16:40.000 --> 16:44.000
There's an extension called extended shared peripheral interrupts,

16:44.000 --> 16:49.000
but you can think of all of these being wires that go into this global interrupt controller.

16:49.000 --> 16:55.000
And the interrupt controller is set up to convert the assertion of any of those interrupt lines into an int ID,

16:55.000 --> 16:59.000
and route it to one of the redistributors.

16:59.000 --> 17:03.000
So every core, in our case, we've just two cores, because I couldn't fit more on the slide.

17:03.000 --> 17:08.000
You have Core A and Core B. You decide whether you want to route it this way or that way.

17:08.000 --> 17:13.000
And that gets sent as a message, so A dashed line here, denotes a message being sent.

17:13.000 --> 17:16.000
That's no longer a signal, that's a memory right.

17:16.000 --> 17:21.000
And then the redistributor, for what's that to the CPU cores,

17:21.000 --> 17:26.000
takes CPU interface, where the interrupt will be delivered to software.

17:26.000 --> 17:33.000
Another question is, how do we assert that interrupt to a user-motivized driver?

17:33.000 --> 17:39.000
And in the ARM case, all the interrupts have a global interrupt ID, which is a number,

17:39.000 --> 17:44.000
ranging from zero all the way up to 2 to 4 or 64.

17:44.000 --> 17:53.000
And we have a summer for table entry. A summer for table, this up to 65,536 entries,

17:53.000 --> 17:59.000
where you can use the hyper-core API to attach interrupt summer force.

17:59.000 --> 18:04.000
So if you are a device driver, and you have access to say interrupt 87,

18:04.000 --> 18:10.000
you can plug the interrupt summer for that interrupt into the summer for table.

18:10.000 --> 18:15.000
And whenever that interrupt is asserted, no one will take a look at the summer for table

18:15.000 --> 18:21.000
to see which interrupt summer for is attached to that and will update interrupt summer for.

18:21.000 --> 18:28.000
So starting with gig version 3, there was a new component added, which is shown in the bottom left corner.

18:28.000 --> 18:31.000
And that component is called the interrupt translation service.

18:31.000 --> 18:38.000
And the purpose of that is to scale the number of interrupts much more than wires ever could.

18:38.000 --> 18:47.000
So as I said, we can have up to 2 to about 16 interrupts here, but you couldn't possibly put that many wires on a small SOC.

18:47.000 --> 18:58.000
So the new concept for interrupts that allow much greater scalability is called LPI, locality approval interrupts, which are message signaled.

18:58.000 --> 19:07.000
So rather than asserting a device, an interrupt line, a device would send a message to a specific register

19:07.000 --> 19:10.000
in the gig ITS called the translate register.

19:10.000 --> 19:15.000
And then the gig ITS actually sees that memory right decodes it to many tables.

19:15.000 --> 19:19.000
So there's a device table and an interrupt translation table and a collection table.

19:19.000 --> 19:21.000
We look into those in a moment.

19:21.000 --> 19:25.000
Figures out where this message signaled interrupt needs to go.

19:25.000 --> 19:34.000
And then again converts it to an int ID and it ripples through the hierarchy the same way as a pin based interrupt wood.

19:35.000 --> 19:38.000
On X86, it is a bit different.

19:38.000 --> 19:48.000
And the goal here is for NOVA to provide the same API or as much of a uniform API to device drivers as we can.

19:48.000 --> 19:56.000
And to abstract the differences between armament X86 as much as possible, but we have to look at how interrupts work on X86.

19:56.000 --> 20:08.000
So we also have a via interrupt controller, which in this case is called an IOAPIC, where pin based interrupts are asserted via real electrical lines.

20:08.000 --> 20:13.000
And then an interrupt message is sent to the CPU course.

20:13.000 --> 20:19.000
But in order to make that secure, there is interrupt remapping in the IOM view that sits in the middle.

20:19.000 --> 20:28.000
And in the sets, the MSI acid flows from the IOAPIC to the CPU's local APIC or X2ATIC.

20:28.000 --> 20:36.000
And the format of messaging interrupts on the left side of the IOM view, the inbound side, is in remapable format.

20:36.000 --> 20:45.000
So every interrupt carries a source identifier, which in the X86 case is not a stream ID or a device ID, it is a bus device function.

20:45.000 --> 20:52.000
So it says it's this device on this PCI bus and this device number and this function number.

20:52.000 --> 20:57.000
And it carries an interrupt number, which is this lot number of this interrupt remapping table.

20:57.000 --> 21:01.000
And it again has two to about 16 entries.

21:01.000 --> 21:06.000
This interrupt remapping table then translates the interrupt to a vector.

21:06.000 --> 21:13.000
Because X86 for backward compatibility reason does not use global int IDs like on V8 does.

21:13.000 --> 21:21.000
It actually has an IDT and interrupt script or table, this only 256 entries on every core.

21:21.000 --> 21:26.000
And out of those 256 entries, 32 are already consumed by exceptions.

21:26.000 --> 21:32.000
And some are consumed by Nova which leaves approximately 200 entries that you can use for device interrupts.

21:32.000 --> 21:42.000
So the goal of the interrupt remapping unit is to translate the incoming MSI in remapable format into an outgoing MSI in compatibility format.

21:42.000 --> 21:57.000
And the compatibility format has a destination ID, which is the APIC ID of where the interrupt is supposed to go to which CPU and a vector, which will then be used to index into the interrupt vector table on that particular CPU.

21:57.000 --> 22:08.000
So the difference between X86 and ARM is that on ARM you have a global table, whereas on X86 you have a pure core local table and we somehow need to deal with this.

22:08.000 --> 22:18.000
Otherwise, it's the same as soon as an interrupt arrives, it will be translated through the sum of four tables of that core and will be signaled on the corresponding interrupt sum of four.

22:19.000 --> 22:25.000
So to sum up what the key differences are, is in both architectures you have global pin-based interrupts.

22:25.000 --> 22:30.000
You have some per CPU local interrupts that are not exposed to device drivers.

22:30.000 --> 22:35.000
They are used by Nova internally like timer interrupts, performance counter interrupts and so forth.

22:35.000 --> 22:44.000
And you have messaging interrupts, which in the case of ARM are translated through tables in the gig IDS in the interrupt translation service.

22:44.000 --> 23:00.000
And they assert a global interrupt ID, and on the X86 case you have messaging interrupts that are translated through one interrupt table in the IOMU and they assert a per CPU vector.

23:00.000 --> 23:10.000
So the table on ARM is indexed by the int ID, and Nova checks that the incoming int ID actually is within the range of the table.

23:10.000 --> 23:16.000
And that you can only plug an interrupt sum of four into the slot belonging to that interrupt sum of four.

23:16.000 --> 23:19.000
So the interrupt number must match the slot number.

23:19.000 --> 23:27.000
And the scalability limit is two to the power of 16 GSI, because that's the capacity that you can have in the remapping table.

23:27.000 --> 23:33.000
The practical limit is actually how many bits of LPI int ID the system implements.

23:33.000 --> 23:39.000
In the X86 case, you only have these CPU local tables with about 200 entries.

23:39.000 --> 23:46.000
So Nova checks if the vector fits with the range of this IDT vector table.

23:46.000 --> 23:54.000
But you have a free choice to say I want to map this global interrupt on this particular CPU to this vector or that vector.

23:54.000 --> 24:00.000
So the scalability limit is two to the power of 16 GSI, but per segment group.

24:00.000 --> 24:07.000
If you have multiple pieces as segment buses, you get twice or three times or four times two to the power of 16 GSI.

24:07.000 --> 24:11.000
The practical limit though is 200 interrupts per CPU.

24:11.000 --> 24:16.000
So if you have just four CPUs, you're practically limit to 800 interrupts.

24:16.000 --> 24:20.000
But if you have 1952 CPUs, the limit is much higher.

24:20.000 --> 24:25.000
So that's just an artifact of X86.

24:25.000 --> 24:34.000
I want to talk a little bit about the gig interrupt translation service and what pain and operating system person has to go through,

24:34.000 --> 24:40.000
because hardware manufacturers try to be very flexible and then decide not to be.

24:40.000 --> 24:47.000
So in the ARM case, a device sends an interrupt with two identifiers.

24:47.000 --> 24:53.000
It has a device ID, which is usually communicated through some sideband signals,

24:53.000 --> 25:02.000
the ITS or this particular device or that particular device allows the ITS to tell different devices apart.

25:02.000 --> 25:11.000
And it has an event ID, which is a local interrupt number that can get translated to this global entity.

25:11.000 --> 25:20.000
And the device table, the interrupt and the collection table are both pointed to by what ARM calls base registers.

25:20.000 --> 25:26.000
And inside an ITS, you have base registers from zero to seven.

25:26.000 --> 25:31.000
But it's not architecturally specified which base register maps what type of table.

25:31.000 --> 25:34.000
So software has to figure that out and remember it.

25:34.000 --> 25:38.000
And they only define this in version gig version four.

25:38.000 --> 25:43.000
But since Nova is backward compatible to gig B3 as well.

25:43.000 --> 25:45.000
We have to discover this and remember this.

25:45.000 --> 25:49.000
So this is just unnecessary inflexibility.

25:50.000 --> 25:56.000
So in interrupt arrives, the ITS uses the device ID to look up an entry in the device table.

25:56.000 --> 26:00.000
And the device table points to an interrupt translation table.

26:00.000 --> 26:06.000
The event ID is used as an index into the interrupt translation table.

26:06.000 --> 26:11.000
And inside there, the ITS finds the final int ID.

26:11.000 --> 26:18.000
And the interrupt translation table entry also carries a collection ID, which says which interrupt collection.

26:18.000 --> 26:21.000
So you can group interrupts into collections.

26:21.000 --> 26:25.000
That's this belong to and we don't use this in Nova too much.

26:25.000 --> 26:29.000
We effectively say every CPU has a collection.

26:29.000 --> 26:33.000
So the number of CPUs defines how many collections you have.

26:33.000 --> 26:39.000
And this allows us for, for example, to set up the collection table only once at boot and to never touch it again.

26:39.000 --> 26:41.000
So now comes a really interesting part.

26:41.000 --> 26:44.000
It's when you want to program these base registers.

26:45.000 --> 26:52.000
Depending on how many bits of device ID you want to support, this device table can be one level or two level.

26:52.000 --> 27:00.000
And there's an indirect bit which you can set in the device table entry to say I want the one level table or two level table.

27:00.000 --> 27:01.000
Sounds very flexible.

27:01.000 --> 27:10.000
Until you read the fine print in the manual that says this bit is read as zero right ignored for gig implementation that only support flat tables.

27:10.000 --> 27:16.000
So in the reverse case you decide a software that you want the two level table and the hardware says, nope.

27:16.000 --> 27:18.000
I only support flat tables.

27:18.000 --> 27:27.000
And then there's other bits which say what is the shareability of this table can this actually participate in coherency or not.

27:27.000 --> 27:33.000
And you may say I want the table to be shared with all the processes in the in our shareable domain.

27:33.000 --> 27:38.000
And then the fine print in the manual says it is implementation defined whether this field has a fixed value.

27:39.000 --> 27:50.000
Or can be programmed so again you have to deal with all the possible cases of oh it's actually not shareable and you have to do manual coherency management.

27:50.000 --> 28:02.000
And the worst part of it all is there's a page that's actually two page size bits where you can say a table in this hierarchy is 4k 16k or 64k in size.

28:03.000 --> 28:11.000
And again there's some fine print that says the gig implementation may only support one fixed page size and the field might be read only.

28:11.000 --> 28:25.000
So what software actually has to do is it has to make a guess and say I wanted to level table with 4k try to program that and then you read back the configuration the hardware says nope.

28:25.000 --> 28:34.000
Okay, then I'll use 16k tables and then the hardware says nope.

28:34.000 --> 28:45.000
And in the worst case you end up with oh you can only have 164k table and you can only have one level so a flat table.

28:45.000 --> 28:59.000
And then you may end up having to concatenate 256 64k pages in one flat table which means you have to allocate a giant 16 megabyte table and your memory allocator needs to support that.

28:59.000 --> 29:05.000
So we have this algorithm in over and it works but this amount of pain is interesting.

29:05.000 --> 29:30.000
So let's talk about the number API because when you want to configure an interrupt or a device you have to name it and we make interrupt and devices available as capabilities because capabilities provide for very nice authorization model where you can say if I have a capability I can name some resource and it also conveys my access to it.

29:30.000 --> 29:45.000
So multiple of these hypercalls need to name devices or interrupts but as you've seen the naming between x86 and arm is not the same on arm an interrupt has an end ID on x86 an interrupt has a global system interrupt number.

29:45.000 --> 30:04.000
So the idea is that when we create a device or when we create an interrupt number 4 we provide the name once and this is the part of the API which is architecture specific.

30:04.000 --> 30:25.000
So for when you create an interrupt sum of 4 you either provide the name in an arm specific manner on an x86 specific manner and then you get a capability to an interrupt sum of 4 which is uniform across all architectures and likewise when you create device context you provide the name for the device either in the x86 or in the arm format depending on which architecture you're on.

30:25.000 --> 30:35.000
And then you have a device capability which is again generic across architectures so it keeps the rest of the hypercalls API clean and you no longer have to worry about those differences.

30:35.000 --> 30:52.000
So device context is what we introduced for devices to encapsulate all this architecture specific state into a new object you don't really create a device because the device is there but you create a management context for device which you call device context.

30:52.000 --> 31:11.000
And it encodes the topology so the device ID the stream ID on an arm case and the PCI topology in the x86 case but it also encodes information about which SMMU or get asked is managing the DMA and interrupt translation for the device.

31:11.000 --> 31:25.000
And then you get a capability which you can use with the assign dev or assign hint called to associate that device with the ms base or to route an interrupt from that device.

31:25.000 --> 31:40.000
So in inches of time I'm going to skip this slide which talks about the novel releases and which features appeared in which novel release suffice it to say that we do a release a new release every two months for this six releases per year.

31:40.000 --> 31:54.000
And all the things that are listed in both on the slide we've briefly covered in this talk there's one thing that I want to mention at the end and that is when we implemented all this stuff we could get no running quite nicely on.

31:54.000 --> 31:59.000
Grab it on server machines which require all these features for scaling.

31:59.000 --> 32:07.000
But then we tried to put an over on a new machine that we've got which was an Nvidia digit spark which is one of those Nvidia AI machines.

32:07.000 --> 32:16.000
And unfortunately this machine doesn't have any URLs or know you are connected so you can't really capture the output.

32:16.000 --> 32:22.000
And what we had to do is we actually had to implement the framework for console in Nova.

32:22.000 --> 32:30.000
This is something that I've tried to avoid for a long time because I didn't want to do any front rendering and scrolling and blitting.

32:30.000 --> 32:34.000
But I had to do it to actually get some output on this box.

32:34.000 --> 32:44.000
And I had to implement a lot of performance tricks but none of this code uses any architectural specific assembler code it's all portable C++.

32:45.000 --> 32:52.000
But it's very heavily optimized and we can actually get very high frame rates for scrolling.

32:52.000 --> 32:57.000
So on the Intel platform we can actually get 4,500 frames per second.

32:57.000 --> 33:02.000
And even on some Vimpi Raspberry Pi we still get 245 frames per second.

33:02.000 --> 33:09.000
And you can actually see the boot output from just Nova is nothing running on top of it on the digit spark.

33:10.000 --> 33:18.000
So if you're interested in what we needed to do to make this happen, it's just 350 lines of code of which 250 lines are front loops.

33:18.000 --> 33:24.000
So what's next, Nova has been open source for 15 years.

33:24.000 --> 33:29.000
And as I mentioned, the open source from the specs and proofs last year.

33:29.000 --> 33:37.000
This year, later this year we are going to make significant portions of the Nova user and stack available publicly on an open source license.

33:37.000 --> 33:45.000
And we are currently figuring out how we can reduce the barrier of entry such that it becomes easy for all of you to use this.

33:45.000 --> 33:49.000
So what we decided we will do, but it's open for discussion.

33:49.000 --> 33:51.000
So if you have ideas, talk to us.

33:51.000 --> 33:58.000
We want to make everything configurable with K config so that you have a global config for the entire software stack.

33:58.000 --> 34:04.000
When you do this and you type make you get one giant image which contains all the components.

34:04.000 --> 34:12.000
And you can just boot this image from your favorite bootloader and there's three different ways of booting supported.

34:12.000 --> 34:24.000
You can either boot this as a booty boot image or you can boot this like a Linux kernel in the form you would boot it busy image or you can boot this as you if I executable from your UEFI shell.

34:24.000 --> 34:27.000
So the same image works in all these three different ways.

34:27.000 --> 34:30.000
So with that, my time is up.

34:30.000 --> 34:33.000
I don't know, maybe you have room for one or two questions.

34:33.000 --> 34:38.000
I have one question.

34:40.000 --> 34:41.000
I have a question.

34:41.000 --> 34:42.000
Thank you.

34:42.000 --> 34:45.000
For questions, regarding data.

34:45.000 --> 34:52.000
So did you have the chance to introduce user interrupt insights and shows on the latest Intel CPUs?

34:52.000 --> 34:58.000
I read about it in the manual, but it's becoming available only on the latest processes of generations.

34:58.000 --> 35:04.000
and since we go all the way back to the first 64-bit capable processors,

35:04.000 --> 35:08.000
it's not clear if you want to support that.

35:08.000 --> 35:10.000
Yeah.

35:16.000 --> 35:20.000
The question was, what are the firmware requirements?

35:20.000 --> 35:23.000
On X86, it's all UFI.

35:23.000 --> 35:25.000
On ARM, we actually prefer UFI.

35:25.000 --> 35:30.000
If we can, but we can also launch from UBoot or on ARM,

35:30.000 --> 35:34.000
we ask the firmware not to close boot services.

35:34.000 --> 35:38.000
So, you boot into Nova with UFI boot services enabled

35:38.000 --> 35:41.000
and then Nova actually closes the boot services

35:41.000 --> 35:45.000
after obtaining a memory map and taking some information from UFI.

35:45.000 --> 35:49.000
So, for example, we use the UFI graphics output protocol

35:49.000 --> 35:53.000
to determine whether we have a framework for and what the resolution is.

35:54.000 --> 36:00.000
Which means, we actually switch to the highest possible resolution in UFI.

36:00.000 --> 36:05.000
So, UFI sets up the framework for and the resolution for us.

36:05.000 --> 36:09.000
And we do not have to worry about EDID, display resolution,

36:09.000 --> 36:10.000
mode switching.

36:10.000 --> 36:14.000
We effectively get a linear framework for and we can just write it.

36:14.000 --> 36:19.000
We still have to do all the manuals crawling and glyphorandering and whatnot.

36:19.000 --> 36:21.000
Thank you.

36:21.000 --> 36:23.000
Thank you.

