WEBVTT

00:00.000 --> 00:21.240
Okay. So, hello everyone. I'm Gabba Boscoge. I'm going to talk about name resolution

00:21.240 --> 00:29.280
in package managers from Lab Produce Beat Perspective. This is going to be rather practical

00:29.280 --> 00:39.480
and like side-by-side comparison of three prominent package managers and it's not going

00:39.480 --> 00:49.240
to be so theoretical. I'll leave this to the next presenter and yeah, just a few words about

00:49.240 --> 00:58.000
me. So, as a judge, day job I work at Roche. In the Rochlinux team, we are providing the

00:58.000 --> 01:06.300
IBM-based distribution for medical devices. I also work as a geeks developer for roughly

01:06.300 --> 01:14.600
a decade now, which is a functional package manager. And yeah, as hobby I like astronomy and

01:14.600 --> 01:24.640
I like sailing. So, let's get into it and see what we have. So, let's first have like a few

01:24.640 --> 01:32.000
definitions cleared up. How I'm going to handle packages in this talk is that we are thinking

01:32.000 --> 01:40.160
about them as a triplet, the triplet of package name, a package version and actual data

01:40.160 --> 01:48.320
that is comprising this package. This is not always like 100% true. For example, you might

01:48.320 --> 01:54.920
end up in the unfortunate situation that the same name and version is actually designating

01:54.920 --> 02:06.600
multiple different packages. This is not so good. Yeah, then what is reproducibility? Reproducibility

02:06.600 --> 02:13.920
is a property that when you are like building the same source then you get the same binary.

02:14.000 --> 02:21.120
This is a pretty straightforward thing that we would like to achieve. It is not always working

02:21.120 --> 02:29.520
like this. But why is this important? Why do we want to have this? That's because if you

02:29.520 --> 02:37.120
think about the complexity of contemporary software, just by having a look at its binary

02:37.120 --> 02:43.640
representation, it is very hard to come to any conclusion about what it actually does.

02:44.200 --> 02:50.680
That's why we are preparing to work with them in source form. But then viewed like to

02:50.680 --> 02:57.000
establish a correspondence that, okay, this binary is really corresponding to this source code.

02:58.360 --> 03:03.400
But now with the current tooling that we have at our disposal, the only way to

03:04.120 --> 03:09.960
actually do this is to rebuild it from source and then see that we are coming up with the same result.

03:10.600 --> 03:19.880
Right now going backwards does not work. Yeah, it works for like small stuff, but for typical

03:19.880 --> 03:35.240
software it does not fly. So yeah, when we are talking about reproducibility, we are talking

03:35.240 --> 03:44.120
about different layers of reproducibility. We can have like a layer on package level that, okay,

03:44.120 --> 03:50.600
this package is reproducible. If I build it from source then I get the same package. We can also

03:50.600 --> 03:58.920
talk about this at different levels. For example, if we are talking about reproducibility as a system level,

03:59.800 --> 04:07.400
then what we expect is that from the same definition of the updating system, I got the, let's say,

04:08.040 --> 04:17.320
same set of files on the carbrive of the system. And for this to happen, we actually have to pin

04:17.320 --> 04:24.120
the whole transitive dependency graph of the system image. Because if there is like just one version

04:24.120 --> 04:31.880
is much anywhere, it is not going to be a reproducible system image anymore. Yeah,

04:36.200 --> 04:42.280
why do we have a name resolution in packaging systems is that we don't really want to encode

04:42.280 --> 04:48.440
this whole transitive dependency graph with all the versions for every package definition.

04:49.320 --> 04:54.520
Most probably in the first place, we don't have enough information to do this. And then we

04:54.520 --> 05:01.880
would duplicate also like an image amount of information. So we have to simplify somehow and

05:04.360 --> 05:13.400
be able to express the dependencies in a way that we are referring to like not a complete

05:13.400 --> 05:22.120
three-plat of the package. But let's say we are just speaking the name or the name and the

05:22.120 --> 05:27.320
version of something like a smaller piece of information to identify this package.

05:30.600 --> 05:36.280
We would like to essentially have something like this that we have a package called name

05:36.360 --> 05:48.280
and this depends on this and this and this package. And also because we want to be reproducible,

05:48.280 --> 05:56.360
we want to have this process to be deterministic. In most cases, this is not deterministic.

05:56.360 --> 06:03.560
There are a few approaches that are taken, but usually it involves like creating a candidate set

06:03.640 --> 06:10.920
of packages that we can resolve the dependency to and then somehow selecting one from them.

06:10.920 --> 06:18.200
Usually let's say the latest version. But then this in latest version might depend for example on

06:18.200 --> 06:25.960
when the package repository update was run the last time on the system and this kind of things.

06:26.680 --> 06:31.880
And this is not really acceptable if you want to have a reproducible build of the system.

06:32.520 --> 06:43.240
So somehow you have to circumvent this thing. And the other thing is that we don't have to

06:44.200 --> 06:50.680
don't want to author all the packages that are like existing in the world because that would be

06:50.680 --> 07:00.120
like super annoying. We want to share them somehow. So we want to have a place where we share

07:00.120 --> 07:06.440
that okay, this software can be packaged like this and from this package definition you can somehow

07:06.440 --> 07:15.400
build it or you have a binary that you can download that is going to correspond to that particular

07:15.400 --> 07:27.160
piece of software and you can install it on your system. Yeah. So we are having like repositories

07:27.240 --> 07:32.680
which is like called in a different way by every package manager I think that exists.

07:33.880 --> 07:41.160
We already had this that for example in Rustware calling these the packages we are calling

07:41.160 --> 07:47.800
creates and repositories we are calling registries. But these are conceptually the same thing.

07:47.960 --> 07:57.960
Yeah. Okay. So right now what we are going to look at is how we are defining this process.

07:59.160 --> 08:07.480
First we have like a bunch of configuration that is like the active configuration of the package

08:07.480 --> 08:13.880
manager. One piece of this information is which of these repositories are used.

08:14.280 --> 08:25.240
Yeah. So where I am sourcing the package definition from and this is like a common piece of

08:25.240 --> 08:33.320
information. It is done very similar in each package manager. You have usually one or more

08:33.320 --> 08:42.120
configuration files with information like this. This is an example for the Debian source definition.

08:43.000 --> 08:49.880
You read an example for Cargo how you define the registry there. This is an example in

08:49.880 --> 08:59.320
Geet how you define a channel there. Yeah. And then all of these repositories have some kind of

08:59.320 --> 09:08.040
underlying data structure. So here in Rust Cargo I am pretty sure that there are a bunch of

09:08.120 --> 09:13.800
people who is like much more knowledgeable than I am because Cargo I am using only as a user.

09:14.360 --> 09:20.440
This is roughly what I know about it that you are pointing your you are out there and that's

09:21.320 --> 09:27.240
more of the information you pass there. In Debian there I know a bit more about this.

09:27.240 --> 09:34.120
I would say that I am forward user. So here you have this sign by

09:34.520 --> 09:43.320
Stanza which is basically pointing to GPG key. And then in Debian repository you actually have a

09:43.720 --> 09:50.360
release file which is signed by this key and it is containing an integrated database of the actual

09:50.360 --> 09:59.720
state of the repository. Yeah. And in Geeks what we have is actually each of these channels are

09:59.720 --> 10:09.320
Geet repositories. And then basically there we are using the security model of the underlying Geet

10:09.320 --> 10:17.720
repository. When we are accessing it we have of course the option to have like TLS. And then in the

10:18.520 --> 10:25.880
content of the Geet repository we have the Geet commit signing and all these additional security

10:26.040 --> 10:41.320
properties that Geet gives us. And what is interesting about this is that when you are thinking

10:41.320 --> 10:53.080
about this then the Cargo and Debian registry are like not really having their full history at

10:53.080 --> 11:02.200
your disposal. Yeah. On the other hand what Geeks channels do because they are Geet repositories they

11:02.200 --> 11:10.760
have the full history available. So that means that basically you can time travel backwards and

11:10.760 --> 11:19.240
forward a longer channel definition. And then you can have let's say a package from like five years

11:19.240 --> 11:31.400
ago pretty easily pulled in. Yeah. And also what this means there is it's pretty easy to actually

11:31.400 --> 11:40.440
pin down the whole repo state. Because you just notify which channels are in use and you write

11:40.440 --> 11:47.320
that like the Geet commits that they are at. And then you have a full fully reproducible repository

11:47.400 --> 11:55.080
state. And in the other cases you have to actually resort to some kind of endlessly important

11:55.080 --> 12:03.880
information to do this. Yeah. And then how does a package definition look like? Because this is also

12:03.880 --> 12:12.680
important from the reproducible perspective. So let's have a look at how Debian does this. Debian

12:12.680 --> 12:21.480
package is like defined by quite a few files. And the control file is basically giving you like the

12:21.480 --> 12:32.360
name of the package. And it also tells you is dependencies. Yeah. And then you also have the change

12:32.520 --> 12:41.000
log. The change log is essential telling you the version of the package. More or less. Yeah.

12:42.680 --> 12:50.360
And then you have basically the rules as the building formation, the source format because Debian

12:50.360 --> 12:58.440
is able to like build the packages in different ways based on the structure of the actual package

12:58.440 --> 13:07.560
definition on the Debian side. And this is roughly building up a package. Yeah. So most of these code

13:07.560 --> 13:13.960
are snippets. They are not full definitions. They are just the parts that are like important from

13:14.040 --> 13:33.560
this perspective. Then as so. So. Yeah. So in Rust you do something like this. This is in a

13:33.560 --> 13:41.960
single file. And then you have the name and the version. And then the dependencies. So this format

13:42.040 --> 13:48.200
is like way more compact. And here in the dependencies you actually have a bit of

13:48.920 --> 13:55.320
linions. You can say that it's like bigger than this, bigger than that, but slower. Yeah. So you

13:55.320 --> 14:02.600
can have like a specification of the versions that are allowed for the resolute. And in this form

14:02.600 --> 14:13.640
it is actually tying the versions, but you can have like more in for here. And then this is how

14:13.640 --> 14:21.960
it looks in geeks. What you see here is that the package name is basically the plicated.

14:22.920 --> 14:31.000
So the name of the package after this name stands up is what you are referring to the package as

14:31.080 --> 14:37.720
on the command line. And the top one is actually defining a variable. And this is how you

14:37.720 --> 14:48.600
are referring to the packaging API. And then in the inputs this name that you have there is corresponding

14:48.600 --> 14:57.080
to these variables. And then the name resolute in geeks is basically the resolution of the variable

14:57.160 --> 15:04.680
binding in the underlying programming language. That means that when you put this inputs list

15:05.320 --> 15:15.320
geoc, then you already pin the version to the exact one, like a very similar definition to this form.

15:16.520 --> 15:22.760
And this is always like fully pinned. There is literally no way to express anything besides

15:22.920 --> 15:30.120
this binding construct. That means that this structure is completely reproducible from the

15:31.480 --> 15:40.600
state of the repository. So a few observations, the BMP packaging requires

15:42.200 --> 15:49.160
quite a bit of work. I think everyone who ever did this knows that this is not like

15:49.240 --> 15:56.200
three of feet. It can be done pretty quickly. But yeah, it can be a complicated list.

15:57.480 --> 16:05.640
Yeah. And the geeks package is basically this is what I told about is that the variable binding

16:05.640 --> 16:11.160
resolute is already giving you a full graph of type dependencies.

16:11.160 --> 16:21.960
Then there is usually a last step which is missing from geeks is that once you have the candidate,

16:21.960 --> 16:32.440
how you actually find out which ones to manifest on the system. So normally in the BMP system

16:33.560 --> 16:40.040
you are manifesting the latest version that is available in all the repositories for the given package.

16:41.560 --> 16:45.960
Except if you are having the conflict or if you are defining pinning.

16:47.640 --> 16:53.880
So that means that if you pin every single WN system you get reproducible build. Then you have

16:53.880 --> 16:58.440
like one of these pin definitions for each and every package on your system.

17:00.360 --> 17:04.920
This is perfectly doable. But it's pretty inconvenient.

17:05.080 --> 17:16.760
Then in cargo we are creating the lock file. This is like pretty typical single bunch of

17:16.760 --> 17:22.040
other package managers are doing this thing. That basically when you are building,

17:22.040 --> 17:29.560
it is like resolving the dependencies, drops that into a file for you. And then that file is like

17:29.560 --> 17:33.960
taking precedence for the future builds until you say that you want to update it.

17:35.080 --> 17:39.720
This is giving you like reproducible set of sources to build from.

17:46.120 --> 17:50.200
And then geeks we already resolve basically at the finish on time.

17:51.000 --> 17:57.400
By this trick that we are actually putting the fully fledged package objects into the graph.

17:57.400 --> 18:13.400
Yeah, then next thing is that when we have these repositories then there are a few

18:13.400 --> 18:21.640
problems to take about. So sometimes the repository state is affecting your build.

18:21.880 --> 18:32.280
You can work around this in all of the systems. It is like pretty much dependant on how convenient

18:32.280 --> 18:38.200
to work around this if you want to reproducible build. And then the other thing is that the user

18:38.200 --> 18:43.560
or the client side of the thing is not having control over this state.

18:43.560 --> 18:52.760
Yeah, so if you get in control over this state then there is no problem. But usually this is

18:52.760 --> 18:58.360
not the case. You cannot tell, let's say, the WMP people that were pleased don't publish

18:58.440 --> 19:15.320
the tricks anymore. This just does not fly. Yeah. And then there are a bit of a problem

19:15.320 --> 19:23.800
so we do look five base things. The biggest problem there is that you are not like require

19:23.800 --> 19:31.480
to publish your look file. And then there is like a very hard to figure out. For example,

19:31.480 --> 19:38.760
what your test suit was run against. Yeah, because it gets detached, you cannot like reproduce

19:38.760 --> 19:46.040
the same thing because you are missing this information. So one way to improve here is to

19:46.040 --> 19:57.080
increase committing the look file into the source control. Then there is a third thing.

19:58.040 --> 20:04.840
This is affecting all the systems. Sometimes the like upstream just goes away.

20:06.840 --> 20:09.720
And then you don't have a way to rebuild it anymore.

20:09.720 --> 20:23.160
Yeah. So what is the benefit that is like provided by having central managed repository?

20:24.040 --> 20:30.440
Is that the repository managers can enforce certain properties. So this is for example where

20:30.440 --> 20:40.840
crazy.io can shine. Yeah, because they can say that okay, let's say we are not allowing to remove

20:40.840 --> 20:50.200
stuff. Or we are not allowing to push a different binary corresponding to the name and version

20:50.200 --> 20:59.080
and these things. And if this is like centralized, then this is giving you the possibility to do

20:59.800 --> 21:08.040
this. And then the situation is way better than like on the wild west where everyone is

21:08.040 --> 21:16.280
pushing everything into the repository. Yeah, but this still requires some coordination.

21:18.120 --> 21:26.280
Yeah. So approaches that can be done when there is no central management is basically two

21:27.160 --> 21:31.560
either you take over the repository to your side and define it state completely,

21:33.480 --> 21:39.640
which is one approach that is like typically applied. And then the other one is

21:41.880 --> 21:49.320
that you can add additional configuration information to always resolve to the same name. Let's say

21:49.560 --> 21:57.000
pinning. Yeah, but this is usually like a lot of work. And then it does not really

21:57.000 --> 22:04.840
mitigate against this package removal problem. Yeah, actually taking over the repository does

22:04.840 --> 22:12.600
because there you just say it, you don't remove it and that's it. Yeah. And then in functional

22:12.600 --> 22:20.600
package managers what we have instead is that this is like a completely tied

22:22.520 --> 22:30.040
the repository state. It is always at the table to the actual build. So you can have a provision

22:30.920 --> 22:37.320
file that is like telling you that it was built by having the package manager pointing to

22:37.320 --> 22:48.200
these commits on these channels and that's it. And then we also do engage something very

22:48.200 --> 22:55.000
interesting about the removal of the upstreams because there we are still relying on the source at least.

22:56.520 --> 23:05.160
There are like a bunch of systems for example software heritage is allowing you to send the source

23:05.160 --> 23:14.280
code to them and then they are archiving it. And before the upstream you are as are going down

23:14.280 --> 23:23.720
software heritage still have a copy from which you can build. Yeah, and also basically in

23:24.120 --> 23:31.880
this functional package managers the binary packages are just a cache. There is a promise,

23:31.880 --> 23:38.920
let's say, that you can build the exact same thing from source if you want, but we are providing

23:38.920 --> 23:44.680
in our build infrastructure, something that we build ourselves, feel free to track it, otherwise you

23:44.680 --> 23:56.280
cannot our signature and then trust the binary is that we build for you. Yeah, so packaging effort.

23:56.600 --> 24:04.200
So usually for language package managers it's relatively low. They are made to be frictionless with

24:04.200 --> 24:11.640
that particular programming language. Yeah, but there are two things here. One of them is the diamond

24:11.640 --> 24:18.200
problem when you are having like two packages depending on the conflicting versions of something that

24:18.280 --> 24:28.440
is like single package. This is a 3K1 and then it can result in all kind of interesting situations.

24:29.320 --> 24:33.560
You can avoid this for example by running the whole thing in a container, then they are not

24:33.560 --> 24:42.840
touching each other and then we are happy. Yeah, and then there is the infrastructure problem

24:42.840 --> 24:47.720
with language specific package managers, which means that basically the things that you built

24:48.760 --> 24:54.280
are still depending on the actual versions of the tool chain whatsoever that is like not defined

24:54.280 --> 25:03.640
in the package definition. And yeah, basically here you want to have an isolation.

25:04.440 --> 25:16.680
And then the effort is usually like you create the metadata build instructions and then you

25:16.680 --> 25:24.360
test the package and then you can release it. It's very similar to functional package managers.

25:25.800 --> 25:31.720
And then there is a maintenance effort which goes into like how hard it is to update this.

25:31.720 --> 25:47.800
Yeah, and once we have like all of this, then we are ending up with like some interesting data,

25:47.800 --> 25:55.560
like how well this thing actually works. And what we get is that now we are having like a

25:55.560 --> 26:05.320
95% of reproducibility for almost every thing, which is really nice. There is one missing point,

26:05.720 --> 26:11.480
which is really bad as of now, is that most of the time we are going to build from really

26:11.480 --> 26:17.000
Starbucks on the distribution level. And then the creation of the really stable from the sources

26:17.000 --> 26:23.960
usually not reproducible. This is something that we want to address in the near future.

26:24.920 --> 26:32.280
And then my key point about the whole thing is if you make this simple for everyone,

26:33.240 --> 26:40.920
then everyone is willing to have reproducibility. So let's go here and try to make reproducible

26:41.000 --> 26:52.240
be a CEC. Any questions?

26:52.240 --> 27:18.240
So it doesn't seem to be the case, thank you very much.

27:18.240 --> 27:28.240
Thank you very much.

