WEBVTT

00:00.000 --> 00:11.160
Okay, welcome everybody to this talk, today we are going to talk about error recovery

00:11.160 --> 00:14.560
with microS and systemy blast butter.

00:14.560 --> 00:21.320
So first of all, today we are going to talk about errors, bugs, warnings and I know that

00:21.320 --> 00:27.520
almost all of us will going to encounter bugs and this kind of things every day.

00:27.520 --> 00:33.400
So we are going to see how we can recover from errors in our distribution.

00:33.400 --> 00:40.640
So I am Daniel Spinala, a quickly working assuse as a research engineer and we are going

00:40.640 --> 00:45.760
to talk first of all about health checker, then about bootloader specification and at

00:45.760 --> 00:49.160
the end we are going to talk about automatic boot assessment.

00:49.160 --> 00:54.440
So to talk about health checker, we need to first talk about officers and microS, how many of

00:54.440 --> 01:00.680
you know offensive microS, okay, that's a lot of people, nice.

01:00.680 --> 01:06.760
So offensive microS is the offensive ratio, the offensive version of a susanim center price

01:06.760 --> 01:07.760
micro.

01:07.760 --> 01:15.040
It's an imutable system, it uses butter effect napshots and cross-sectional update and since it's

01:15.040 --> 01:20.680
imutable, all applications are installed in containers, could be docked or put in containers

01:20.680 --> 01:26.640
or flat-pack containers and it also tries to be reliable as much reliable as possible

01:26.640 --> 01:29.560
by leveraging snapshots.

01:29.560 --> 01:34.460
So this is an important thing to talk about because you are going to use snapshots for

01:34.460 --> 01:40.840
to recovery from errors, every time we change the system, maybe we update all the packages,

01:40.840 --> 01:46.920
we install or remove something, we need to create a new snapshot and how do we boot a

01:46.920 --> 01:54.480
different snapshot, how do we select a snapshot, we can use this sample loop, counter-flock,

01:54.480 --> 01:58.800
we can say which snapshots we are going to use, we just change the version number, this

01:58.800 --> 02:05.100
case is the first one but we can choose whenever this snapshot we want and the key software

02:05.100 --> 02:11.080
the key component that handles all these snapshots update, the packages update, it's

02:11.120 --> 02:16.800
a transaction update, it also handles butter effect napshots, today we also have the transaction

02:16.800 --> 02:22.960
update container here, it's glass foster, so let's talk about our checker, the checker

02:22.960 --> 02:28.480
is the key component, it comes installed by default in microS, first of all what it does

02:28.480 --> 02:35.080
is check the status of the boot, which is in each check different components, for every component

02:35.080 --> 02:41.200
there is a different plug-ins, this means also that all the administrators can extend

02:41.200 --> 02:45.200
what they mean to check the boot, maybe they have some different components, they have

02:45.200 --> 02:49.680
some different service, they want to check, with a checker they can do that, and most

02:49.680 --> 02:55.480
importantly it has a phase safe mechanism, so when there is an error, as checker tries to

02:55.480 --> 03:01.880
reboot and tries to bring the system to our work is snapshot, let's talk more about

03:02.120 --> 03:08.960
the phase safe work, so in this case for the entries, if we have a new boot entry let's

03:08.960 --> 03:15.240
say we just update the system, which I boot in that, if there is any issue with that, any

03:15.240 --> 03:22.120
failure, any failure, then checker automatically reboots up to three times, and if the phase

03:22.120 --> 03:28.920
then it goes to another snapshot, trying to find the snapshot the works, of course,

03:28.920 --> 03:33.320
at the time we created a snapshot, the new snapshot is at the fault one, and in this case

03:33.320 --> 03:41.400
if it is not working we want to update that, and bring the default entry to our work in

03:41.400 --> 03:47.720
snapshot, the last thing is okay, we have as a snapshot is working, we have a version of the

03:47.720 --> 03:52.600
system that is working, but somehow we reboot, and there is no working anymore, in the case

03:52.600 --> 03:58.480
what happens is that as checker will reboot one time, the same snapshot, the same version

03:58.480 --> 04:04.960
of the system, and if it is not working, we are going to open an emergency shell.

04:04.960 --> 04:10.160
Now let's talk about bootloader specification, which is usually abbreviating in VLS, our

04:10.160 --> 04:16.440
many of you know the bootloader specification, okay, that's good, let's see what it is,

04:16.440 --> 04:23.160
it's something very, very new, but it's getting very quick adoption, so this is a Linux

04:23.400 --> 04:27.720
system, this is a basic system, just if you have a system, if you have a partition,

04:27.720 --> 04:33.560
the own partition, the system partition, very basic, and then we might have an located space,

04:33.560 --> 04:40.760
this is the Birmingham, no encryption, nothing, let's say that we want to add another system,

04:40.760 --> 04:48.280
we have to want, we want multi-boot, how do we do that? So we need to consider a lot of things,

04:48.360 --> 04:54.520
first of all the boot FBI partition, the type partition needs to be unique on each disk,

04:54.520 --> 05:00.200
so we need to be aware of that, so the other system maybe will install a different bootloader,

05:01.000 --> 05:05.560
okay, but which bootloader will be the default one, the first system, the second system,

05:07.080 --> 05:14.040
also the bootloader we will choose needs to be aware and to be able to read the boot partition

05:14.040 --> 05:18.360
of the other system, otherwise it cannot create the entry for both systems, like we will

05:18.360 --> 05:23.240
obviously we will only have the entries for the last system, the restore and that's not something we want,

05:24.440 --> 05:30.200
we get complicated very fast if we add encryption and system on different partitions and

05:30.200 --> 05:37.560
different disks, so for this reason the bootloader specification has been created and it allows

05:37.720 --> 05:45.240
multi-boot system to share the entries, our message does that, we have configuration files

05:45.240 --> 05:50.840
that are standardized, we have two types of different configuration files and we will put them inside

05:50.840 --> 05:57.480
the slash loaders, slash entries, path relative to the API partition, so inside this folder

05:57.480 --> 06:03.240
we will have all entries so all the system is told that we want the bootloader to find this also

06:03.240 --> 06:10.360
is very good because we have standardized configuration, we can, we can have all the applications,

06:10.360 --> 06:19.240
user specification, firmware, bootloaders, read the same files, we can have a common interface that we can

06:19.240 --> 06:27.080
use. Let's see at the first type of configuration that is the most common, we are not going to

06:27.080 --> 06:35.320
share the second one, this is a simple key value file, text file, you can see in the name we have

06:35.320 --> 06:41.560
the distribution, the kernel version and also in this case this is of course a microS boot entry,

06:41.560 --> 06:46.760
we also have this snapshot which is the last one you can see before the suffix, so what do we

06:46.760 --> 06:52.280
have here, we have the pre-title, we have the version of the kernel and this snapshot,

06:52.920 --> 06:59.880
then we have the kernel arguments and the paths to the kernel and the intramethas, in this case

06:59.880 --> 07:06.440
these paths are relative to the API partition, so very straightforward, very simple,

07:08.360 --> 07:13.560
and the automatic boot assessment relies on the bootless specification, it's also part of the

07:13.560 --> 07:21.080
bootless specification and maintain the existing D, with the automatic boot assessment we can divert

07:22.280 --> 07:28.040
the system to a previous version, we can revert the kernel baby to a previous version,

07:28.040 --> 07:35.720
it allows it as components, it allows us to revert the system and as I said, it relies on

07:35.720 --> 07:42.920
bootloader specification, so let's see the first part of the automatic boot assessment that we need,

07:42.920 --> 07:48.840
first of all we need boot counting, what is boot counting, we are the number of boot attempts

07:48.840 --> 07:58.600
that we want to try for every new entry, how do we add that, we add as more suffix to the final

07:58.600 --> 08:05.240
name of every entry, so we add a new entry, a new version, okay we see we want to try three times,

08:08.200 --> 08:13.240
since we have not booted yet, we can say that in this case we have three times left,

08:13.960 --> 08:22.280
that entry is called in the terminate, we try booting, the booting is successful, in that case

08:22.280 --> 08:28.200
the boot counter is removed and we can say that this entry is considered good, we know that it works,

08:29.800 --> 08:36.040
what happens if it fails, we try booting again and again and every time we decrease the counter,

08:36.040 --> 08:40.440
so we have three, then two, then we go to zero, in that case the entry is considered bad,

08:43.560 --> 08:48.920
and here we can see an example, the same entry, we boot counting enabled, in this case we have two

08:48.920 --> 08:54.920
tries left and one already done, as you can see it's very straightforward inside the file name,

08:54.920 --> 09:03.240
we have the plus prefix and afterwards we have the tries left and optionally the amount of tries we have

09:03.240 --> 09:08.040
done, this is very simple because this also needs to be handled by the bootloader, so we want to be

09:08.040 --> 09:15.480
as simple as possible and renaming the file is very, very straightforward, so what does it mean

09:15.480 --> 09:21.480
installing new entries? The counter about the boot tries that we want to do is actually

09:21.480 --> 09:28.840
can be found inside the slash it is e, slash kerna slash trice and usually when we want to

09:28.840 --> 09:35.240
start new kernel, a new version, sometimes we call kernel isle and kernel isle is aware of the boot

09:35.560 --> 09:42.760
counter, so it goes to read the file and creates a new entry with the boot counting enabled,

09:44.280 --> 09:49.640
same thing about the STB utility, STB utility is utility that can be found on tumble with the

09:49.640 --> 09:59.640
microS and most times, it manipulates the entries, sometimes maybe STB utility recraced the

09:59.720 --> 10:07.880
interface in the case, it also needs to reset the boot counter and STB utility is aware of the boot

10:07.880 --> 10:15.160
counting, so these are the two first components that creates the entries and are aware of the boot

10:15.160 --> 10:24.600
counting, right after that we need the support of the bootloader, so we the inside the bootloader

10:25.560 --> 10:30.600
we will have all the entries, there will be read inside the slash loader, slash shantys folder

10:30.600 --> 10:37.400
that we have seen before and upside of the user sorting depending on the name of the distribution,

10:37.400 --> 10:42.680
the version, we will have one other type of sorting which is the sorting depending on the

10:42.680 --> 10:49.640
counter or the boot counter, so any entries that are the zero tries left would be boot at the bottom,

10:49.640 --> 10:55.560
we know that the entry is not working, so the bootloader is automatically sorted as last,

10:57.720 --> 11:04.920
and another thing that is very important is every time the bootloader boots an entry,

11:04.920 --> 11:11.480
so we choose an entry or maybe the default one will be booted, then the bootloader needs to

11:11.480 --> 11:18.520
read the file name, check the boot counter and rename the file, it needs to adjust the counter,

11:18.520 --> 11:24.440
so we have three types left, okay we just did one, now we have two, we did one, so the second

11:24.440 --> 11:31.720
count is increased, regarding the support system the boot as a computer support for bootloader

11:31.720 --> 11:40.120
specification and automatic boot, automatic boot assessment, you have two aware, as part of

11:40.120 --> 11:46.200
just support, we are going to see later, we are going to talk more later about that, but the good

11:46.200 --> 11:53.400
thing is that any bootloader can support automatic boot assessment regardless, it just needs to

11:53.400 --> 12:02.680
be aware of the boot counter and is to rename the file names, so let's talk a little bit more about

12:02.840 --> 12:11.480
bootloader specification, and there were two different sets of patches, one made by Fedora,

12:11.480 --> 12:18.920
and one made by OpenSUSA, did that the support, I forgot to the 14, just came out recently,

12:18.920 --> 12:25.160
I think he was last week, and now we have an upstream support for the LS, but it's still in complete,

12:25.880 --> 12:33.720
boot counting of 14 is not supported upstream and is still downstream supported by OpenSUSA,

12:33.720 --> 12:45.000
but the patches will be upstream very soon, so we saw the boot counting part, now let's talk

12:45.000 --> 12:51.000
about the boot itself, so we are testing the system is booting, if we want to have a service

12:51.080 --> 12:58.440
the check if the system has booted successfully, like check in this case, but there are also

12:58.440 --> 13:03.480
other services provided by the fault of the system, there is for example a service that says

13:03.480 --> 13:09.960
we don't want any service to fail, if any service fail, then in the case the boot was failure,

13:10.760 --> 13:16.040
we just need to add to our service unit required by boot complete the target,

13:17.000 --> 13:24.600
boot complete the target is a special target unit made by system D, and vice versa,

13:24.600 --> 13:30.680
if you want to have some service that acts upon a successful boot, like system D,

13:31.720 --> 13:37.240
we can just have the requires boot complete the target to our service, so system D,

13:37.240 --> 13:43.560
less boot, if you actually go read the service, just add these two lines needs to be executed

13:43.560 --> 13:50.200
after the boot is successful, and it also needs to require boot complete the target,

13:51.960 --> 13:56.200
so let's talk a little bit about system D black boot, this is a more utility provided by system D,

13:57.000 --> 14:03.000
first of all, the thing that it does is it is executed by the fault when the boot is successful,

14:03.000 --> 14:07.880
and it just removes the boot counting in the case, we can also use it to get the status of the

14:07.880 --> 14:13.880
current entry, we just call it, and we are going to see if the entry is in the terminate,

14:13.880 --> 14:21.240
it is bad, this is good, this is used by F checker actually, and last thing we can also change

14:21.240 --> 14:25.880
the status of a boot entry, we can say, okay this entry, okay we consider it as bad,

14:25.880 --> 14:28.840
we can just call the system D black boot and it will do the renaming for us,

14:31.320 --> 14:36.760
why are we adding L checker on top of automatic boot assessment, why not using it directly?

14:37.880 --> 14:42.120
There are a few issues with automatic boot assessment, first of all, it is very simple by the

14:42.120 --> 14:49.560
sign, because it is made by implemented by boot loaders, the boot loaders are already complex

14:49.560 --> 14:58.440
as they are, and we want to have something as simple as possible, that means that we don't have any

14:58.600 --> 15:05.400
automatic reboot on failure, the default engines are not updated, if we set an entry as the fault,

15:05.400 --> 15:15.400
and the entry is bad, by system D black boot status, we will still boot it again,

15:17.000 --> 15:21.080
the boot loaders consider any default entry, the Goddess of the boot counting of the boot status,

15:21.080 --> 15:25.480
they consider it as default, and they will boot it anyway, this is not something that we want,

15:25.480 --> 15:30.440
so we need something on top, to add logic, and to provide a better free safe mechanism.

15:32.440 --> 15:38.920
So let's go and see again how the F checker face safe work, now that we are seeing boot loader

15:38.920 --> 15:45.320
and these tools provided by system D. So first of all, to check the boot entry status,

15:45.320 --> 15:52.040
we use system D black boot, we check, we update and remove the default entry if we need that,

15:52.440 --> 16:02.440
and so when the boot counting is enabled for an entry, then we know that the entry is new,

16:02.440 --> 16:15.480
we leave the sorting of the boot loader to choose the next entry, and check the reboots on failure.

16:16.440 --> 16:21.560
Now, if the entry was working before, we know that it was working, it still opens some memories and she shall,

16:21.560 --> 16:27.560
so, there was the presentation, thank you for your attention, so we can go to the questions.

16:27.560 --> 16:55.640
So the question is if the patches, the optimization of BLS support, is actually the patches from

16:55.720 --> 17:00.120
open suzer fedora, I think it's a clear implementation made by the grab developers,

17:02.440 --> 17:05.480
but yeah, so please.

17:09.880 --> 17:15.880
So, against tempering, so the question is how safe is BLS against tempering, so in the case,

17:16.200 --> 17:28.200
we using TPM, we can act to work tampering, so yes, so this actually is implemented by

17:28.200 --> 17:33.320
a process of microS, by using a studio tool, we try to work against tampering, yeah.

17:41.560 --> 17:44.680
So the question is if this is related to system D, what system D does,

17:45.240 --> 17:50.760
yes, we use system D tools to work with the TPM, thank you, please.

17:57.000 --> 18:02.360
So the question is if check it was just with UFI or with it works without, so

18:03.720 --> 18:10.840
the automatic boot assessment, cool work without FI, there is one other thing which we didn't

18:10.840 --> 18:17.560
talk about in this presentation, with this boot load interface, boot load interface contains

18:17.560 --> 18:23.160
some metadata and data to be shared with the system D and the boot loader, and that works

18:23.160 --> 18:29.080
only with FI, automatic boot assessment relies a little bit on the boot load interface so

18:29.080 --> 18:34.760
currently it doesn't work without FI, but this support to be added in the future. Thank you.

18:36.680 --> 18:38.040
Any other questions, please?

18:41.800 --> 18:45.480
Yes, this one, yeah.

18:45.480 --> 18:50.920
If you follow anything, any other issue, and there's something we can have about changing this,

18:50.920 --> 18:55.800
I'm not very likely to have changing this, because what does default mean? Sure, it override

18:55.800 --> 19:03.400
the assessments, definitely, but I think we can add another concept of something that was a little

19:03.400 --> 19:08.920
bit less powerful than just basically says, yeah, I think these kind of items by default,

19:10.040 --> 19:15.720
unless they are changing the assessments of the account, so I will not come like, tell the end,

19:15.720 --> 19:19.800
tell the rules, you just say, but take to the bed, let's fix that.

19:19.800 --> 19:29.240
So the question is regarding the default entry, because the system D upstream, they are going to change that.

19:29.880 --> 19:35.560
Yes, I'm really in the discussions, they still a little bit on the fence to change the

19:36.280 --> 19:40.520
this mechanism, hopefully they do, which are we can remove this from the check it as well.

19:41.560 --> 19:45.800
And then the other thing, no automatic boot is available, you can do this here, if you like,

19:45.800 --> 19:51.080
we just use this open to fill with value, right? Like, if the boot success stuff kind of

19:51.080 --> 19:56.520
be reached, then some action should be taken obviously, but the processing downstream, this is like

19:56.840 --> 20:00.600
the choice for you, like, what's right, please, okay, but so there is a thing like, for example,

20:00.600 --> 20:05.720
you can send a time out until this point is reached, and that's typically what you want,

20:05.720 --> 20:13.960
or you can have an action that you assign to it, like, when this thing never is reached and

20:13.960 --> 20:18.680
thing like this, so there's actually already this, it's maybe we can get better documentation

20:18.680 --> 20:23.560
to tell you exactly how this is about each amendment, any kind of additional infrastructure to

20:24.520 --> 20:29.000
just make sure to make use of the stuff that's already then configured to the policies that you want.

20:30.040 --> 20:36.440
So, uh, Assistantie actually provides the way to, uh, reboot automatically, for example,

20:36.440 --> 20:41.480
by setting up time out after a couple of times that the boot doesn't, uh, didn't finish.

20:42.120 --> 20:47.960
So, yes, we are well on that, but for for us, we wanted to, uh, we already had a checker before

20:48.600 --> 20:55.320
so we're in new, what was our successful boot for us, and we still use the same mechanism,

20:55.320 --> 21:00.600
but the boot if there is an issue. So, this is a other logic on top, yeah, we could also do that.

21:00.600 --> 21:05.000
I mean, I'm very curious about, like, what kind of logic we don't cover yet, um,

21:05.000 --> 21:10.200
if it's general, general useful logic, then I would like to know about it.

21:10.200 --> 21:12.680
Okay, we can, uh, we can talk later about this. Thank you.

21:13.480 --> 21:14.040
Yes, please.

21:17.960 --> 21:47.720
So, the discussion was about adding, uh, more documentation, uh,

21:48.360 --> 21:52.120
I'll confirm, okay, we find these things. We don't, we think this is indeed because they are

21:52.120 --> 21:55.960
already implemented. Thank you. Any more questions? Yeah, please.

21:57.000 --> 22:04.120
You mentioned that there are two set of patches submitted to grow up from Fedora, and those who

22:04.120 --> 22:12.440
they are open to this. Yeah. Um, do you have different approaches in that regard, and do you talk

22:12.440 --> 22:20.520
to each other or maybe come up with a solution together? So, uh, the question is regarding the

22:20.520 --> 22:26.200
two different set of patches from Fedora and open sucess. So, I, unfortunately, I'm not aware

22:26.200 --> 22:33.480
of, um, regarding the original work. Um, I don't know if they don't do the gap developers,

22:33.480 --> 22:38.360
and they found a common ground. Unfortunately, I don't know to, I don't know it went,

22:39.000 --> 22:45.640
uh, but currently the implementation of Club is, uh, pretty close. Um, they still a little bit

22:45.640 --> 22:51.080
missing, but we're going to assume that now that we have a baseline in, uh, in common. Thank you.

22:56.360 --> 23:01.480
Okay, uh, let's see cases where you need to add some sort of hardware, what's

23:01.480 --> 23:06.280
up, if you have issues with common, where the common doesn't give you the question, starting

23:06.280 --> 23:12.840
in it. So, if the question is regarding having a watchdog, if the kernel is starting, so in the

23:12.840 --> 23:21.560
case, the boot model is still going to decrease the boot counting. And, uh, we have, um, uh, we have a

23:21.560 --> 23:27.240
reboot on panic on kernel panic. So, if the kernel is panicking, it reboots, we, the count is going

23:27.320 --> 23:35.160
to be decreased. So, system reboot, at one point, we'll choose another snapshot. Thank you.

23:36.920 --> 23:40.680
Please. This is inspired by the same, because the, uh, the, not the network must do.

23:41.480 --> 23:47.240
Uh, sorry, can you please, uh, please? Uh, please, uh, please, uh, please, uh, please, uh, please, uh,

23:47.560 --> 23:55.640
uh, uh, unfortunately I, I, uh, I don't know. We got to think just the same thing. Okay. Yeah,

23:56.200 --> 24:00.680
yes. I don't know if you have the recovery should be pretty similar, so, yes, but I, I don't know

24:00.680 --> 24:06.200
actually, I do and so on, so that. Thank you.

24:06.200 --> 24:10.200
Okay, so no more questions.

24:10.200 --> 24:12.200
Thank you everybody.

24:12.200 --> 24:19.200
Thank you.

