WEBVTT

00:00.000 --> 00:12.720
All right. Welcome everyone. My name is Torsten, and this is Isha. Both of us will be presenting

00:12.720 --> 00:19.760
the work ALIXOS has been doing with HSM signing, but I want to reiterate this was a team effort

00:19.760 --> 00:25.360
from various other people involved in ALIXOS, like Mike over there in the corner, and also a

00:25.360 --> 00:30.320
big shoutout to Tommy, who did really the bulk of the work that we're going to present. You

00:30.320 --> 00:37.520
can't be here today, but thanks to Tommy. And let's get started with the lessons that we learned

00:37.520 --> 00:45.760
from using HSM's to sign in. So, to start basic, like, why using HSM's all at all, and what

00:45.760 --> 00:53.360
is in HSM? So, code signing is used in a lot of places, like even Microsoft uses it, Apple uses

00:53.360 --> 00:58.880
it. You may know it from Google Play, apps need to be signed, like, easy mention it in the stock

00:58.880 --> 01:04.320
before. So, it verifies the origin of the software, verifies it comes really from you, and

01:04.320 --> 01:09.920
that from anybody else, because you hold the private key. And this is where the problem starts,

01:09.920 --> 01:18.160
like you need to keep it secure. And if it's a file, it has the bad property that files can be

01:18.160 --> 01:24.720
copied quite easily. And you don't know if it has to copy it. So, that's a problem, because

01:25.520 --> 01:34.960
you may have lost your private key, and you don't even know it. Anybody who has a copy of your

01:34.960 --> 01:44.080
private key can make valid signatures forever, and you have no way to stop them. So, even if

01:44.160 --> 01:48.880
you just cover it, you know, oh, this guy is making up what you're going to do. The only option

01:48.880 --> 01:55.520
you have is start again. Make a new version of your software under a new name, maybe,

01:55.520 --> 02:00.080
with an Android use and you package ID, like, is it just mentioned people lose their private

02:00.080 --> 02:05.280
keys even, right? So, they have to start over. They can, yeah, revoking the key in Android,

02:05.280 --> 02:15.360
I think it's not a thing, and not aware of that. You can do it. So, this is when HSM comes in.

02:16.320 --> 02:24.240
So, it stores the key in hardware, and there you can just copy it. It's in a little device,

02:24.240 --> 02:30.000
or in a bigger device, on a special security cryptographic ship, and you're not supposed to

02:30.000 --> 02:35.680
get it out anywhere possible. There's a caveat though, that if there is, like, a big, powerful,

02:35.680 --> 02:40.960
attacker, like, nation-state-like, and they have physical access to HSM, they may be able to use

02:40.960 --> 02:46.320
electron microscopes, and aced, and scrape off the layers of the key, and somehow extract

02:46.320 --> 02:56.960
key material, but this is maybe not our threat model. So, one use case that is maybe interesting

02:56.960 --> 03:03.280
from many people here is you have automatically automatic build signing set up in your CI,

03:03.280 --> 03:09.520
like you take a release, CI builds it, and it even signs it automatically. So, with the private key

03:09.520 --> 03:14.720
files, it's kind of tricky, because if somebody hacks your CI server game over, like they have

03:14.720 --> 03:21.440
the private key in the scum, like if they hack your CI with an HSM, they may be able to do some

03:21.520 --> 03:27.680
malicious signing, because they have access to the HSM, but you can stop them. You can plug the HSM

03:27.680 --> 03:33.440
out, you can re-secure system set it up again, throw out the attackers, and they lose the

03:33.440 --> 03:42.000
ability to sign. So, that's the key difference. Yeah, that's what I just said, and now

03:42.720 --> 03:55.440
I'm handing this whole thing here over to Shur. Testing testing, okay, awesome. So, coming into

03:55.440 --> 04:00.240
this, we now know we want to use an HSM, raw contents in this room, that an HSM is better than

04:00.240 --> 04:06.160
having a private key file just following your own in the ether, and in order to make the decision

04:06.160 --> 04:10.640
about what HSM to use, we have a lot of different options, we want to set a number of criteria for

04:10.640 --> 04:15.840
us to decide this. So, the first is we want to be guided by HeluxOS values and principles.

04:16.400 --> 04:22.960
The primary of months, those, you know, you see privacy, security, and accessibility, these are the

04:22.960 --> 04:27.200
same values that drive a development of HeluxOS, the different features that we have.

04:28.160 --> 04:33.120
Second to that is free and open source, any work that we did, we want to make sure that we could

04:33.120 --> 04:38.240
share it with the developer community, but not only the developer community, we also wanted to share it

04:38.320 --> 04:47.040
with our community who's using the phone, and so being able to have this level of transparency where

04:47.040 --> 04:52.960
we share both the process but also the code is incredibly important to us. The second set of criteria

04:52.960 --> 04:59.360
was security, and so it was incredibly important, of course, it's incredibly important to make

04:59.360 --> 05:06.000
sure that both the organization, the team, and then the members on our project were able to have

05:06.000 --> 05:12.480
ownership over this. So, who has which part, who is in charge of admin, administration, who is

05:12.480 --> 05:17.280
in charge of signing, who is in charge of auditing. So, you wanted to ensure that we have distinct

05:17.280 --> 05:24.640
role-based access and follow the principle of least privilege. This will help protect against any

05:24.640 --> 05:29.920
authorized access and exposure, but of course we can't just say that we have to see that. So,

05:29.920 --> 05:36.000
for that, we need logs. So, who does well operation? What are they signing? What are they doing?

05:36.000 --> 05:41.840
And when are they doing it? The second set of criteria was more operational related to resilience

05:41.840 --> 05:48.720
and scale. So, in this case, we first want to make sure that we are redundant if the team changes

05:48.720 --> 05:54.320
something happens to someone God forbid, or someone spells water on the HSM, we are able to continue

05:54.320 --> 06:01.920
signing, operating, and releasing scale-less releases. The second set here is we also don't have

06:01.920 --> 06:08.560
infinite time to decide, create this provisioning, signing ceremony. So, we want something that was

06:08.560 --> 06:14.320
tiered, that was expandable, and that will keep it improved over time. So, that we can eventually

06:14.320 --> 06:22.160
if needed as needs capacity resources time allow, we can lift and shift it over. So, all these

06:22.160 --> 06:31.520
in mind, we end up with the UV HSM, too. The biggest and most important reason that we ended up here

06:31.520 --> 06:37.440
is availability, just getting the physical hardware to different folks in different places on our

06:37.440 --> 06:43.520
team, our team is many different regions all across the world, was not the easiest and so being

06:43.520 --> 06:49.360
able to secure that and secure that was really important. The second reason is that has open source

06:49.360 --> 06:54.320
tooling that we were able to leverage and build on top of and finally affordability. If you search

06:54.320 --> 07:00.800
up the prices of available HSM's right now, you can go from thousands to tens of thousands,

07:00.800 --> 07:07.680
so at a unit price of like 650 bucks, this was a lot more affordable. Of course, because of that

07:07.680 --> 07:12.800
also, there's a lot of room for improvement, and hopefully future plans. Now I'll pass it on.

07:12.880 --> 07:21.760
So, we're seeing this as an intermediate solution, really. Like it's a cheap hardware

07:21.760 --> 07:27.840
dongle, relatively cheap compared to the professional ones, and we are hoping, eventually we

07:27.840 --> 07:34.320
can improve this, but this is what we ran with also to collect experiences. So, let's talk

07:34.320 --> 07:38.400
already about some of the limitations of this little device that we're using, so we don't get

07:38.400 --> 07:42.800
money for advertising that and we didn't really want to advertise that the device actually has

07:42.800 --> 07:48.400
some issues, and one is related to key management. So, the storage capacity is quite small.

07:49.200 --> 07:55.840
If you have just a couple of apps you want to sign, that is fine. But in our case, we are having

07:55.840 --> 08:01.440
more than 20 different Android devices, and each of them should have its own key, and then there's

08:01.520 --> 08:08.560
also different apps inside these devices, and there is different other partitions and keys

08:08.560 --> 08:13.760
that all need to be different. So, we need roughly at the moment, if we trim it down as much as

08:13.760 --> 08:22.960
possible, around 50 keys, and it's too much, doesn't fit. So, there is a solution for that,

08:22.960 --> 08:30.240
and it's called key wrapping, and key wrapping means that you store your keys outside of the

08:30.240 --> 08:38.320
HSM, but not like this, you encrypt them before, right? So, they get imported when you need them,

08:38.320 --> 08:44.720
and decrypted only inside the HSM. So, they never leave the HSM in plain text, and there is a

08:44.800 --> 08:51.920
wrap key that does all that work, and that one lives inside the HSM. Does it make sense?

08:54.320 --> 09:05.600
So, yes. So, now we have this wrap key, but how do we back things up now? Because the

09:05.600 --> 09:10.560
HSM, like I just said, may break, may get stolen, there's all sorts of stuff that things can go

09:11.200 --> 09:15.120
wrong. So, we need to have backups, in case something like that happens, we can still continue to

09:15.120 --> 09:20.560
sign, and we don't have to ask users, or reinstall our devices, because we need new keys.

09:23.040 --> 09:27.680
So, the signed keys are easy, because they are encrypted. You can back them up, like you back

09:27.680 --> 09:33.680
up anything else, like encrypted cloud backups or whatever. Because nobody has this wrap key,

09:33.680 --> 09:38.400
they can't use the encrypted signed keys for anything. But what about the wrap key itself?

09:38.880 --> 09:45.600
In the end, that is, again, some sort of key you need to keep private. If you keep it in the

09:45.600 --> 09:56.320
file, you back to square one, and you have all the problems all over again. So, what we found as the best

09:56.320 --> 10:02.320
solution is what's called Charmiers secret sharing, which is actually quite an old technique.

10:02.320 --> 10:09.920
There was a paper published in 1979, and this is still used today, and this is quite good.

10:10.560 --> 10:16.880
So, the idea here is that you have this wrap key, and you split it up in different parts,

10:16.880 --> 10:22.160
which are called Charmiers. And if you ever need to get back to this wrap key,

10:22.160 --> 10:28.000
like you would store it from your backup, then you need these shards come together, and it can be a

10:28.000 --> 10:31.760
certain number of shards. Like if you have 10 people in your team, everybody has a shard,

10:31.760 --> 10:35.680
you can say, oh, and one, at least eight of those people to come together, or set it. Let's

10:35.680 --> 10:43.280
up for you to decide. So, that's quite powerful. But unfortunately, the UBHSM doesn't

10:43.280 --> 10:48.960
support that natively. So, we had to do a complicated key provisioning ceremony.

10:48.960 --> 11:00.880
All right. So, the complicated key provisioning ceremony, exciting. So, this is kind of where it goes

11:00.880 --> 11:07.040
from theory, a lot of research, a lot of code written to actually conducting and provisioning

11:07.040 --> 11:12.480
the keys, generating the keys. I'll make one quick note here, and one of, we have three main

11:12.480 --> 11:19.200
ceremony goals. The first one is to generate the keys on two HSM's. The reason we chose to generate

11:19.200 --> 11:25.440
keys in this scenario is that it gives us a blank slate to work with existing old keys that

11:25.440 --> 11:31.280
maybe exists in backups. The other people had access to are not kind of going to be factored in,

11:31.280 --> 11:37.520
and it gives us a reduced attack surface. So, if we are redesigning and we're doing this, we want to do this,

11:38.240 --> 11:44.240
so the goals of our provisioning ceremony, one is to generate the keys on two HSM's. Again,

11:44.240 --> 11:49.680
redundancy. Two, it's to be able to export the rat key, sharded, and the authentication keys for

11:49.680 --> 11:56.480
different roles. And then three, it's to establish the audit, logging mechanism for all HSM operations,

11:56.480 --> 12:04.720
starting with the provisioning ceremony. Now, the ceremony operation logistics was complicated,

12:04.800 --> 12:12.080
took a lot of work, tried to cover it in a slide really quickly, but we started by trying to

12:12.880 --> 12:20.720
ensure that we have a secure protected environment. And this, of course, starts physically, so

12:20.720 --> 12:25.200
having somewhere that we could not disclose to our community, give updates on, on when we are doing

12:25.200 --> 12:30.640
it, and where the place was, of course, physically secured. The second part is digitally, so where

12:30.640 --> 12:34.320
we're actually running the scripts to provision the HSM's, where they're plugging them into,

12:35.840 --> 12:40.880
that we decided to go with sales OS's, a femoral operating system that's audited to use,

12:41.600 --> 12:50.320
and as well a known in the community. In terms of having the accessibility to our own team members,

12:50.320 --> 12:57.040
making sure everyone can attend, we try to bring the majority of our team in person for this ceremony,

12:57.120 --> 13:03.680
although we did have an option for remote participation. And then finally, the last main piece here

13:03.680 --> 13:09.600
is having the hardware. So making sure that the computer running this on, we maybe had a little

13:09.600 --> 13:13.760
purchase paranoia, but we went to a random store that we hadn't decided ahead of time.

13:13.760 --> 13:18.240
Some of the computers were available, picked up the computer, same thing for any media we used,

13:18.240 --> 13:25.120
the live boot USB, for example, or anything that we used to save backups on, and then making sure

13:25.120 --> 13:30.880
that we both physically reset the UB to send to, but also via software. And of course,

13:30.880 --> 13:37.360
no network connectivity throughout the entire ceremony. All of this was a learning experience for

13:37.360 --> 13:43.920
our team, but something that really helped was working hand-in-hand with trail of bits and performing

13:43.920 --> 13:49.440
a software audit at the end of this that we're looking forward to publishing soon. To make sure

13:49.520 --> 13:53.760
that everything we're running here is following the principles that we want to follow.

13:56.640 --> 14:01.840
So maybe to mention in relation to the previous talk that everything we did there is reproducible.

14:02.480 --> 14:07.120
So the software is all public and you can reproduce the ISO image that we burnt to a DVD

14:07.920 --> 14:18.080
exactly and validate its hash as well. So let me talk to you about PKCS11.

14:19.440 --> 14:24.000
Which is quite the tongue breaker and that's why I guess somebody invented the name cryptocurrency.

14:24.000 --> 14:27.280
So you might have also heard that, but everybody just says PKCS11.

14:28.800 --> 14:35.840
Once you can remember that. And it's a standard C interface to have software communicate to your

14:35.840 --> 14:41.440
HSM because this is in the end of the day what you want. You want your signing solution, you're signing

14:41.440 --> 14:47.440
software, not take a key from the heart drive, but you want to have a key from the HSM to be used.

14:47.440 --> 14:56.480
And let the HSM do the HSM signing. So all tools that are involved in signing Android builds

14:56.480 --> 15:04.320
already supported. So that's great. So we just plug it in and we're done. But unfortunately that

15:04.320 --> 15:12.480
wasn't the case. It was quite complicated actually because we can't just do it out of the box

15:12.480 --> 15:20.160
like it turned out there is lots of problems along the way. And we have quite the complex

15:20.160 --> 15:25.680
signing process. Google has huge scripts that handle this and you have to imagine you have

15:29.200 --> 15:33.680
the build has lots of different partitions and then each of these partitions are signed.

15:33.680 --> 15:38.000
But before you sign these partitions, you have lots of different apps and apexes which was also

15:38.160 --> 15:42.560
not talked earlier that also have to get signed before and then you have to unpack the

15:42.560 --> 15:46.560
partitions. You have to sign the stuff inside. You have to repack. You have to sign the partition

15:46.560 --> 15:51.200
itself and you have to sign everything. And so it's quite complex and there's different tools

15:51.200 --> 15:59.280
involved. The main three ones are APK signer, sign APK and even open SSL. And although you have to

15:59.280 --> 16:06.640
interface properly with your HSM. So how did we hook it all up? I just give you some examples

16:06.640 --> 16:13.200
because we don't have enough time to go through everything. But they had every tool at different

16:13.200 --> 16:19.280
ways and every tool at different bugs. So one easy example to understand is that we found

16:19.280 --> 16:27.760
APK signer for some reason wasn't closing the session. It opened with the HSM. And that caused

16:27.760 --> 16:33.840
the build to have like some sort of a denial of service because suddenly the HSM tooling said,

16:34.240 --> 16:38.400
I'm out of sessions. I can't do anything anymore. But our signing scripts were still running

16:38.400 --> 16:46.480
trying to sign stuff. So we tried several things and in the end we patched the APK signer to

16:46.480 --> 16:54.400
just close the session properly. Then there is this sign APK tool which to be honest we haven't

16:54.400 --> 16:59.200
understood why Google is using two different tools for the same job. And we found that the sign

16:59.200 --> 17:04.720
APK tool is also incredibly buggy and we couldn't even get it to work properly with PKCSC11. So

17:04.720 --> 17:13.760
what we did instead, we also patched APK signer and reverse engineer some stuff of APK to allow

17:13.760 --> 17:20.800
us to just use APK signer instead because that we already got working. And we needed to fix some

17:20.800 --> 17:26.720
padding stuff. So the result is actually identical to what sign APK would be doing so the

17:26.720 --> 17:33.200
builds would actually be booting on a real phone. So some some points about performance,

17:33.200 --> 17:38.720
we also needed to tweak some stuff here because APK signer was being invoked a lot during the

17:38.720 --> 17:44.960
signing process. And it loads when it gets invoked, it loads the key store which is fine if the key

17:44.960 --> 17:51.520
store is on your file but if you have an HSM, it's kind of slow. It takes a couple of seconds to get

17:51.520 --> 17:57.600
from all the keys since we fill up the entire HSM also with lots of keys. It's like taking much too long.

17:58.400 --> 18:03.600
So we introduced some sort of a batch mode where it would keep the key store memory and would not

18:03.600 --> 18:08.000
try to re-enusionalize the key store all the time. It's kind of a hack but there was

18:08.000 --> 18:13.120
the best we could do for now. Another thing is that there is a different versioning

18:13.120 --> 18:20.160
schemes for APK signatures. And the V1 scheme requires the HSM to basically go through all files

18:20.240 --> 18:24.160
and do individual signatures for all the files in an APK and can be 100 results of files

18:24.720 --> 18:29.840
in this way to slow. So the newer signature schemes allow us to just make one signature on the

18:29.840 --> 18:38.160
entire APK invoked APK signer just once and let's wave faster. So now our last topic is auditing

18:40.400 --> 18:42.400
which I shall talk about.

18:43.040 --> 18:52.480
All right, so far we've talked about how we are having backups, how we can ensure that there

18:52.480 --> 18:59.120
is no way to export the keys, but security is not just about making sure that we know what's happening,

18:59.120 --> 19:05.600
it's about making sure that we can see it as well. There's proof of it. So so far, again, we

19:05.600 --> 19:10.640
kept the key safe, secure, and private, but moving forward, how do we ensure that it is safe,

19:10.640 --> 19:16.480
secure, and private, and stays that way that there's no unauthorized access operations,

19:16.480 --> 19:21.200
signing operations that have occurred. And coming into this again, we did end up hitting

19:21.200 --> 19:29.120
some limitations with the UVHSM2 keys. The first who was related to the space available. So this

19:29.120 --> 19:34.880
time is not about the signing key space or the primary key space is about the audit log space.

19:34.880 --> 19:39.840
So there is a circular buffer that's available that will be rewritten if you run out of space.

19:39.840 --> 19:46.640
And so our solution here was one to enable a property on the HSM such that if you have

19:46.640 --> 19:52.000
are not able to have any logs, then you are not able to do any operation. So did I complete the

19:52.000 --> 20:00.400
Nile of service. The second part here was to have effectively an interceptor running that would

20:00.400 --> 20:06.560
follow every signing operation and immediately write the log immediately, write the log

20:07.520 --> 20:14.000
to the computer afterwards. The second issue here was the authenticity of the logs himself.

20:14.000 --> 20:20.240
So the builds and everything that is signed is signed by the HSM itself. However, the logs are not

20:20.240 --> 20:29.040
cryptographically signed. And so here, even though that we have the provisioning ceremonial logs

20:29.040 --> 20:34.480
from day zero, we're not able to chain them unless we want to head and create a this get repo,

20:34.480 --> 20:40.960
append only gets repo. And so the idea here is that you have automated log verification

20:40.960 --> 20:47.680
using this built in CI. And so combining these two for the most part, recognizing here again,

20:47.680 --> 20:53.600
there is it's not perfect, but you have mostly self-sufficient and self-expository logs.

20:56.160 --> 21:02.160
I think that is the conclusion of our presentation. It's kind of a narrative of how we went through

21:03.120 --> 21:10.720
figuring out defining and building this HSM signing solution. And recognizing that this is not

21:10.720 --> 21:16.480
a perfect solution, but it needs a lot in the most of our criteria. And moving onwards, I'll

21:16.480 --> 21:17.840
open it up to questions.

21:32.480 --> 21:38.320
It's a good question to follow. Great overview. HSM, I'm impressed with what you're doing as well.

21:39.360 --> 21:45.600
Why did you guys go with Metro key or go with that in a portability for more of a bit source things

21:45.600 --> 21:53.280
like the RM or you go? Yeah, at the time we were able. Why did we go with UB? I'm going to summarize

21:53.280 --> 22:00.000
the question. So why did you go with the UB HSM two, which was more proprietary, as opposed to the

22:00.000 --> 22:04.800
Nitro key at the time you just weren't able to get our hands on the Nitro key?

22:09.520 --> 22:14.080
Yeah, the Nitro key HSM is more is cheaper, but unfortunately we couldn't get it. And when I looked

22:14.080 --> 22:19.280
into it, I also didn't see any audit capabilities. Like maybe you can tell me, does it have one?

22:24.640 --> 22:29.040
Yeah, but it doesn't have built in auditing like the UB HSM has, I think. So that is also another

22:29.120 --> 22:31.120
reason why we didn't go for that.

22:32.000 --> 22:43.920
So specifically, other wrongs, wrongs also sign the wrong with the ability HSM.

22:45.200 --> 22:53.520
So why are you running into the PQC as 11 bucks with sign it because now and how

22:53.520 --> 23:01.680
will have other projects? So the question is, are there other wrongs are doing this as well

23:01.680 --> 23:07.920
of PKCH 11? Why do we run into these issues and the other people not? The thing is, we actually

23:07.920 --> 23:13.120
not aware of any other end red wrong that is using HSM's. Like do you know? And at least

23:13.120 --> 23:18.800
publicly, like we even heard like some stories from OEMs that also just have plain key files

23:19.680 --> 23:21.680
for signing. So I don't know, do you know any?

23:27.440 --> 23:33.360
So you thought that your phone was using HSM, but to my knowledge, even they still are not using

23:33.360 --> 23:39.120
HSM to sign the bill. So they also have files still. So to our knowledge, this is actually

23:39.120 --> 23:44.160
first in the Android wrong ecosystem. So we kind of pioneered it in the space and that's what we

23:44.240 --> 23:49.840
had to solve this problem initially. But we hope that other projects will also pick this up

23:49.840 --> 23:53.920
and that we can together come to better and more sustainable long-term solutions for that.

24:00.800 --> 24:01.920
One more question there?

24:02.560 --> 24:08.800
Is there only a single key involved when signing and then starting to say anyway how through

24:08.800 --> 24:16.000
HSM's could be required to sign something? Is there only a single key involved or is there

24:16.000 --> 24:22.240
where the two HSM can be involved in signing something? So there's several keys involved in signing

24:22.240 --> 24:28.880
a bill. There is several core system component keys, which each system component uses a different key,

24:28.880 --> 24:33.520
then there's the AVB device key, and if we wanted we could have hundreds of keys for one

24:33.520 --> 24:38.240
bill, but we try to keep it as minimal as possible. And the question about the second HSM,

24:38.240 --> 24:44.560
I'm not sure, I understood like for parallelism that is faster or for a single component that

24:44.560 --> 24:51.680
two HSM is required, for example, that the CIS signature is required and a particular procedure,

24:51.680 --> 25:02.560
for example. Yeah, that's interesting idea to use two HSM's for that, but we haven't explored

25:02.560 --> 25:08.080
that yet. We try to get the basics ready and solve all the problems and then iterate from

25:08.080 --> 25:15.440
there on and improve the solution. Times up. Thank you very much.

