WEBVTT

00:00.000 --> 00:10.140
Hello and welcome. My name is Andrin. I'm a PhD student in the secure and trustworthy

00:10.140 --> 00:17.500
systems group at EKH Cirque. And today I'd like to introduce you to open CCA, a framework

00:17.500 --> 00:23.380
or a tool that we've been building to help you to research on RMSCCA and existing hardware

00:23.380 --> 00:33.100
before CCH is actually available. So quick spoiler, I brought the box, this thing here.

00:33.100 --> 00:37.940
And later in this talk we'll attempt a live demo and put a confidential virtual machine

00:37.940 --> 00:46.260
on open CCA and run some cheaper workloads on it. But more on this later. So the main challenge

00:46.260 --> 00:51.960
today with RMSCCA is really that for most of us there is no CCA hardware yet available

00:51.960 --> 00:58.880
you can just buy and tinker with it. The first rollouts, if you check online, they're likely

00:58.880 --> 01:05.080
data center first. So if you check, you'll see that a bunch of companies, including which

01:05.080 --> 01:12.240
it's Microsoft and also likely in video, they all have announced to provide CCA capable CPUs

01:12.240 --> 01:20.120
for the cloud. But even then it's unclear how open affordable and heckable those platforms

01:20.120 --> 01:28.160
will be. And so if you want to do research on CCA, you typically have two choices today.

01:28.160 --> 01:35.040
One choice is software simulation. So you're thinks like the RMSCCA or QMU. This system

01:35.040 --> 01:40.080
is simulator and entire CCA, you're at the hardware stack in software. And this is great

01:40.080 --> 01:45.600
to validate the correctness of your new research design and also great to validate the compatibility

01:45.600 --> 01:52.240
of how your camera code will run on the next generation of hardware. But what software simulation

01:52.240 --> 01:58.080
is not good at is that it does not tell you anything about how fast your camera code actually

01:58.080 --> 02:04.880
runs. All you know is that you run your instructions correctly, but you let any further insights

02:04.880 --> 02:12.520
into micro-architectural effects like cycles. And so since people also care about cycle, performance

02:12.520 --> 02:18.480
and overheads, what they typically do, and this is also in the context of research, is that

02:18.480 --> 02:23.160
they use their design. They have now run on software simulation and they transplanted

02:23.160 --> 02:32.080
to arm-version aid boards. So the current iteration of arm hardware. And the inside here is,

02:32.080 --> 02:39.560
in many ways, CCA hardware will run similarly to how existing hardware works. So this

02:39.600 --> 02:49.600
can be used to estimate the performance overheads. But this comes with its own set of challenges

02:49.600 --> 02:54.120
because typically these performance prototypes that people are building, they are not open

02:54.120 --> 03:02.560
sourced, making it difficult for others to reuse, re-reproduce their work and in general

03:02.560 --> 03:08.160
which is ways a lot of engineering, since everyone sort of builds their own thing. And

03:08.160 --> 03:15.440
yeah, that does not open sourced it. And so with open CCA, we tried to solve some of these

03:15.440 --> 03:20.640
pain points of these performance prototypes that people are building. By providing some sort

03:20.640 --> 03:28.520
of open baseline to measure and run research designs on open CCA, along existing hardware.

03:28.520 --> 03:39.000
And so at the high level we are trying to accomplish the following goals. So first, we want

03:39.000 --> 03:45.400
to keep the changes to the RCCA reference stack as minimal as possible, while also ensuring

03:45.400 --> 03:50.600
correct functionality on arm-version aid hardware. So this means we want to run confidential

03:50.600 --> 03:58.440
virtual machines on existing hardware. Second, this is very important. So we cannot

03:58.440 --> 04:05.240
give the same or we cannot give security claims of RCCA. This runs on non-CCA hardware and so

04:05.240 --> 04:14.840
this is only for benchmarking and synchron with real devices. Third, we tried to target affordable

04:14.840 --> 04:22.920
and open arm-version aid boards. So the barrier to entry is low and everyone can try to replicate

04:23.240 --> 04:29.240
our setup. And fourth, we tried to focus on the reusability aspect of a framework

04:30.120 --> 04:35.800
in the sense that we tried to be not forged specific as far as this is possible, so that our

04:35.800 --> 04:44.360
work can also be pointed to different boards. So now that we've seen some high level goals, let's

04:44.360 --> 04:50.280
take a look at how we actually building this. And for this, let's recap a bit on on CCA background first.

04:51.080 --> 04:56.920
So you probably noticed the 14 introduction of arm-version 9 we had trust zone that divided

04:56.920 --> 05:04.120
compute into the normal world and the secure world. And now with the introduction of arm-CCA

05:04.120 --> 05:09.960
or in particular a hardware feature called the realm management extension, the architecture

05:09.960 --> 05:17.000
introduces two mobiles. So we have the realm world for CCA's version of confidential virtual machine

05:17.960 --> 05:23.080
machines and we also have the root world for the most privileged firmware code.

05:24.360 --> 05:28.200
Now what's great about the arm architecture is really that it's very explicit.

05:29.000 --> 05:35.400
So we can write firmware in codes and most things are not hidden in closed microcode.

05:37.960 --> 05:44.120
So in open CCA we tried to reuse the reference deck as much as we can while also keeping the changes

05:44.200 --> 05:52.280
small but how do we actually do this? Since we only have the normal world and the secure world

05:52.280 --> 05:58.520
on arm-version 8 hardware, we emulate the realm worlds within the architectural normal world

05:59.480 --> 06:04.520
and the secure world within the root world, sorry within the architectural secure world.

06:05.800 --> 06:11.560
And so at first glance this might seem straightforward but things can get messy quite quickly

06:11.560 --> 06:17.000
because CCA firmware expects certain hardware features to be available and while these are clearly

06:17.000 --> 06:25.560
not present on version 8 hardware. And so this essentially boils down to how do we emulate enough

06:25.560 --> 06:31.000
of these missing hardware features in software to make the firmware believe it's actually running

06:31.000 --> 06:38.760
a real CCA environment while also keeping the changes small. And this essentially means in code

06:38.760 --> 06:46.680
this that we got missing hardware features in in codes and either re-implement them

06:46.680 --> 06:51.560
in software if they're strictly needed to boot confidential virtual machines or we forced

06:51.560 --> 06:58.840
disabled them if they're not strictly needed. And so in our paper and also in code we're going

06:58.840 --> 07:04.360
to much more details how we actually do this but let's take a look at one of these missing hardware

07:04.440 --> 07:11.560
features that are not available on this board. And so this is a hardware feature called TTSC

07:11.560 --> 07:19.480
stands for short translation tables. It's an optimization for the MMU. If your address space is

07:19.480 --> 07:25.880
small with this hardware feature the MMU does not have to do as many page table walks

07:27.160 --> 07:33.720
and so the the page block is faster. And for this let's take a quick look at the memory layout

07:33.720 --> 07:41.320
of the trusted type of devices as the RMM. So on ARM we have two translation table base

07:41.320 --> 07:52.920
registers. We have TDBRC or N1. This is similarly to MCO3 on X86 and the memory layout of the RMM

07:52.920 --> 08:01.560
decides to use TDBRC or things that are mostly identity mapped and shared across course.

08:04.280 --> 08:13.640
And TDBR1 for things that are not identity mapped and per CPU. Thank you.

08:14.440 --> 08:23.320
And so the inside here is in so the RMM only touches a few megabytes in size for TDBR1.

08:23.880 --> 08:31.800
So they use hardware feature so this TTSC. And so the challenge here was that all this feature

08:31.800 --> 08:40.680
is not available on 8.2 so the version that we use. And so the challenge was how to find this

08:41.640 --> 08:48.040
this since the TDBRC crashed and it was only shown in the way out the page structure was filled.

08:50.040 --> 08:56.600
And so the work count here is that we exploit what ARM version 8.2 has which means we increase

08:56.600 --> 09:04.680
TDBR1 to spend to a larger virtual memory size and then we can exploit what the hardware has to

09:04.680 --> 09:15.240
actually make the memory work. So in 2025 we looked into around 40 different boards for this project

09:15.240 --> 09:25.320
and we picked the RK3588 by a rock chip in particular the rock size B model. It's a great

09:25.320 --> 09:31.880
as a season. It has open EL3 so we can flash VMware code. It has good documentation and it's also

09:31.880 --> 09:39.320
somewhat affordable with the 16 gigabyte version starting around 250 US dollars but I think for

09:39.320 --> 09:45.640
gigabyte RAM starts at around 100. And so as I said it's based on 8.2 architecture has

09:46.280 --> 09:54.040
Cortex A76 and A55 course and yeah this is also what I brought today with me.

09:55.000 --> 10:06.360
So currently we are able to boot confidential virtual machines on a stack that's maybe a year

10:06.360 --> 10:14.680
old so it's based on TFA version 2011 and the RMM version 0.5. We have someone that looks into

10:14.680 --> 10:22.120
pulling in the latest changes I think currently it's 0.8 for the RMM and totally we touched around

10:22.200 --> 10:29.400
2.5,000 lines of codes and only enlighten at the FA and the RMM so we don't need to change the

10:29.400 --> 10:37.240
guest or the host or the VMM and so in this 2.5,000 lines of codes this is mostly or to a

10:37.240 --> 10:44.840
large percentage it's board definitions in the RMM and also I think reported the console drivers

10:44.840 --> 10:54.680
of the effective change is actually much smaller than this. Now the OpenCistware project went

10:54.680 --> 10:59.720
for several iterations you're now at the point where we have these stacked boxes in our

10:59.720 --> 11:07.880
lab that include the Raspberry Pi to streamline the firmware flashing and power management on the board

11:08.200 --> 11:17.800
and the edges making working with the platform more easily. Okay so this brings me to

11:17.800 --> 11:24.200
the live demo and the way I want to structure this is so in two parts first that will show you

11:24.200 --> 11:31.080
what the demo will show and then I will tell you what I had to change on top of OpenCCA to make

11:31.240 --> 11:36.760
the demo work. So in the demo we'll boot the confidential version of machine on OpenCCA

11:36.760 --> 11:45.880
I think with one VCPU and 512 mbps of RAM and then we'll attach the Mali G6010 so there is

11:45.880 --> 11:54.680
an integrated GPU on the board we'll attach that and then in the CVM we'll start X and run some

11:54.680 --> 12:03.880
OpenGL benchmark on the GPU and so the purpose of this demo is really to show how easy it is to

12:03.880 --> 12:13.880
prototype systems research ideas on OpenCCA so disclaimer this is not trusted I.O. on purpose we

12:14.680 --> 12:21.720
leave the GPU MMIO hypervisor shared so this is mostly so I think I have to change the ARMM

12:21.720 --> 12:31.080
this uses Mali-LAR ARMM APIs your mathematics goes to the hypervisor and yeah so I had to prototype

12:31.080 --> 12:37.720
and VFI were inspired into app routing and then also create a Stage 2 mapping for the GPU so

12:37.720 --> 12:48.120
the driver can actually talk to the GPU in the CVM yeah so there's a QR code for all the demo code

12:48.120 --> 12:57.720
that I wrote on top of OpenCCA here in this QR code all right so now let's see if this works

13:08.440 --> 13:25.640
so first I will mirror my screen okay so what we're seeing here is I have now UART output and

13:25.640 --> 13:32.920
I mean the normal world in the untrusted hypervisor so this KVM Linux and so as a first step

13:33.720 --> 13:43.560
I will detach the GPU and see and then I will now use KVM tool to boot around the end

13:47.000 --> 13:53.720
with these changes that I introduced before so we create Stage 2 mapping of the GPU MMIO and we have

13:53.720 --> 14:02.840
an introproctor that now forwards the physical introps of the GPU into the CVM if they're

14:02.840 --> 14:15.160
right we also changed KVM tool okay this to create a device to entry for the GPU for the CVM okay so

14:15.160 --> 14:21.560
we are now in the CVM and we we see that it did a bunch of random management interface and

14:21.560 --> 14:30.120
random service interface calls okay let's go on and so as the next step now in the GPU

14:30.920 --> 14:39.240
and so in the CVM I will attach the Pantro driver and use TigerVNC server to spawn X

14:40.600 --> 14:45.480
so you can see here since we created the mapping the the Pantro so that's the GPU driver can actually

14:45.480 --> 14:56.920
talk to the the GPU and TigerVNC server now exposes VNC over TCP IP so now on on my laptop

14:58.120 --> 15:11.320
I can now connect to the CVM or over VNC so now we are in X in the real VM and I sketch some

15:11.320 --> 15:20.760
OpenGL demo for fasting if we take now a look at Proc introps so this does

15:21.480 --> 15:30.920
cat on Proc introps we see that the Pantro driver receives introps to submit or complete

15:30.920 --> 15:36.680
a GPU rendering drops and so if actually now hides this window we see that in no longer

15:36.840 --> 15:44.040
it's been no longer received introps since GPU is no longer used for this open shell rendering

15:45.720 --> 15:52.760
and so so now we are in the in the CVM so this runs a 612 kernel with I think CCA

15:52.760 --> 16:01.880
guest enlightenment version 7 and in our D message we'll see that the RM explosives

16:01.880 --> 16:05.880
where I'm servicing the patient one to the CAND

16:10.120 --> 16:13.000
alright

16:13.000 --> 16:42.000
And so this concludes my talk. So if you work with arm CC and are currently constrained by some of the limitations that self-dressimulation gives,

16:42.000 --> 16:50.000
self-dressimulation gives you, or want to think with real devices in the context of arm CCA,

16:50.000 --> 16:58.000
please check out Open CCA. We have documentation on how to build one of these boxes online,

16:58.000 --> 17:05.000
and our forks of the upstream wrappers on GitHub. We also have an academic publication that goes into more details,

17:05.000 --> 17:13.000
how we build this, and if you're excited about this, please reach out. So thank you very much.

