WEBVTT

00:00.000 --> 00:11.000
All right, time to go.

00:11.000 --> 00:12.000
All right.

00:12.000 --> 00:15.000
Next up, graph vision runner, talking to us about the VFS.

00:15.000 --> 00:17.000
Hello.

00:17.000 --> 00:18.000
Yeah.

00:18.000 --> 00:19.000
Yeah.

00:19.000 --> 00:21.000
It's a microphone zone.

00:21.000 --> 00:22.000
I'm Christian.

00:22.000 --> 00:24.000
I'm one of the VFS maintainers.

00:25.000 --> 00:26.000
And, uh,

00:26.000 --> 00:27.000
I don't know.

00:27.000 --> 00:28.000
I don't know.

00:28.000 --> 00:29.000
I don't know.

00:29.000 --> 00:30.000
James?

00:30.000 --> 00:31.000
No.

00:37.000 --> 00:39.000
No more fips, please.

00:39.000 --> 00:42.000
And I'm one of the VFS maintainers.

00:42.000 --> 00:45.000
And I want to talk a bit about what we've done.

00:45.000 --> 00:53.000
I think it's just a few selected updates of what we have done in the last year, I think.

00:54.000 --> 00:56.000
And this really just a small portion.

00:56.000 --> 01:03.000
And just deals with, like, I guess, a subset of stuff that we have done that is UAPI visible.

01:03.000 --> 01:08.000
Um, there is obviously, as we just heard in a talk before, there's always a ton of changes.

01:08.000 --> 01:10.000
Internal changes going on to make stuff faster.

01:10.000 --> 01:12.000
We work locking and so on.

01:12.000 --> 01:16.000
But that's just not what we're going to be concerned about with yet.

01:16.000 --> 01:18.000
So, I have a, I really have a bunch of topics.

01:18.000 --> 01:20.000
It's not possible to do it.

01:20.000 --> 01:21.000
It's all in 30 minutes.

01:21.000 --> 01:28.000
So, if you have specific things that you are interested in, um, we can also skip over stuff and go, um,

01:28.000 --> 01:29.000
and talk about other things.

01:29.000 --> 01:34.000
And you should also feel free to ask questions if you have any while I'm talking.

01:34.000 --> 01:36.000
It's usually more interactive that way.

01:36.000 --> 01:40.000
By the way, one of the big things is obviously, uh, we have a new Mount API.

01:40.000 --> 01:44.000
It's new in the sense that it came out in seven or eight years ago.

01:44.000 --> 01:46.000
And, uh, parts of user space of switch to it.

01:46.000 --> 01:49.000
Other parts haven't, uh, but it's seeing more and more adoption.

01:49.000 --> 01:52.000
And the really nice thing is that it's not path based.

01:52.000 --> 01:57.000
It's purely file descriptive based, which is kind of neat because paths aren't safe.

01:57.000 --> 02:01.000
Um, and, uh, this way you don't have to have anything visible in the file system.

02:01.000 --> 02:04.000
You can just interact with mounts via file descriptors.

02:04.000 --> 02:08.000
They don't need to be anywhere in your current mount name space.

02:08.000 --> 02:11.000
Um, one thing that we did, that's a while ago.

02:11.000 --> 02:14.000
We introduced, uh, the concept of a 64-bit mount ID.

02:14.000 --> 02:18.000
So that you can uniquely identify, um, each mount.

02:18.000 --> 02:24.000
So before that, we had a 32-bit mount ID and on a busy system where you have lots and lots of containers.

02:24.000 --> 02:29.000
Uh, these mount IDs can quickly be recycled, not just because, like, eventually, you will wrap around.

02:29.000 --> 02:31.000
That's one possible concern.

02:31.000 --> 02:35.000
The other thing is that the way these mount IDs were allocated was like,

02:35.000 --> 02:38.000
if you unmounted a mount that is a specific mount ID, you created another mount.

02:38.000 --> 02:43.000
And it was very likely that it had the same mount ID that the previous mount.

02:43.000 --> 02:45.000
Because, like, the allocator quickly recycled.

02:45.000 --> 02:51.000
For example, the pit number allocator is such that it wraps around, like, before it actually starts recycling.

02:51.000 --> 02:56.000
And this old mount ID allocator always, like, quickly recycled mount IDs.

02:56.000 --> 02:57.000
We changed that.

02:57.000 --> 03:01.000
We gave it a 64-bit mount ID to make it stable and more predictable.

03:01.000 --> 03:06.000
Um, and, uh, there is now an API that is built around this.

03:06.000 --> 03:10.000
Another thing is if you wanted to watch for mount notifications,

03:10.000 --> 03:12.000
you had to pass a specific file.

03:12.000 --> 03:13.000
You had to get to that as well.

03:13.000 --> 03:19.000
We have replaced that part of the API in recent years as well.

03:19.000 --> 03:21.000
So, okay, what have we done?

03:21.000 --> 03:24.000
Um, we added a new system call called step-mount.

03:24.000 --> 03:28.000
And you can think of it as, like, a companion to static in a way,

03:28.000 --> 03:31.000
only that a it doesn't work on file descript as well.

03:31.000 --> 03:33.000
That's technically not true anymore.

03:33.000 --> 03:37.000
It works on 64-bit mount IDs, but for various reasons,

03:37.000 --> 03:41.000
we also made it possible now that you can use FDs with these system calls as well.

03:41.000 --> 03:45.000
Because on the certain circumstances, you cannot look up amount anymore.

03:45.000 --> 03:49.000
I won't go into details, but, like, an unmounted mount,

03:49.000 --> 03:51.000
you can't find by its mount ID anymore,

03:51.000 --> 03:54.000
because it has been wiped from all mount name spaces.

03:54.000 --> 03:56.000
But in any case, 64-bit mount IDs.

03:56.000 --> 04:01.000
So, you call step-mount on a mount ID that you can retrieve that via

04:01.000 --> 04:02.000
statics.

04:02.000 --> 04:04.000
That's where we expose the mount ID.

04:04.000 --> 04:09.000
You can call step-mount and then you get a whole range of information

04:10.000 --> 04:12.000
about this mount.

04:12.000 --> 04:13.000
There's a ton of stuff in there.

04:13.000 --> 04:15.000
You get, like, the super block mount options.

04:15.000 --> 04:17.000
You get a mount option string, which is, like,

04:17.000 --> 04:19.000
all the file system specific mount options.

04:19.000 --> 04:22.000
You can retrieve, like, the generic mount flags.

04:22.000 --> 04:26.000
Obviously, you get the mount points.

04:26.000 --> 04:30.000
You get the root of the mount, which can be different.

04:30.000 --> 04:36.000
If you have a bind mount, for example, and the ton of other stuff as well.

04:37.000 --> 04:41.000
We also enhance the mount name spaces themselves.

04:41.000 --> 04:45.000
They also have a 64-bit mount ID, which means you step-mount

04:45.000 --> 04:48.000
also works across mount name spaces.

04:48.000 --> 04:53.000
So, if you have, if you retrieve the mount ID of a specific mount name space,

04:53.000 --> 04:58.000
in the container, you don't need to be located in a mount name space

04:58.000 --> 04:59.000
of that specific container.

04:59.000 --> 05:05.000
You can call step-mount in a particular mount name space from outside.

05:05.000 --> 05:07.000
Which is quite powerful.

05:07.000 --> 05:11.000
So, because, like, traditionally, interactions with the mount name space

05:11.000 --> 05:15.000
is specifically always required you to be, within that namespace,

05:15.000 --> 05:17.000
to be able to see mount properties and so on.

05:17.000 --> 05:22.000
We don't require that anymore with the new mount API.

05:22.000 --> 05:24.000
The companions system call.

05:24.000 --> 05:28.000
So, here you can see, hopefully, somewhat readable.

05:28.000 --> 05:32.000
Probably not in the back, but you will have to slide to available.

05:33.000 --> 05:37.000
What kind of options you can retrieve by other set mount system call.

05:37.000 --> 05:38.000
It's really extensive.

05:38.000 --> 05:40.000
Like, everything that you could possibly want.

05:40.000 --> 05:45.000
I think it's nowadays even more than what you can see in the mount info file.

05:45.000 --> 05:48.000
You get to that in a second.

05:48.000 --> 05:51.000
The second system call that we added is a list mount,

05:51.000 --> 05:53.000
because step mount is obviously useful.

05:53.000 --> 05:55.000
Like, you can set a specific mount ID,

05:55.000 --> 06:01.000
but obviously you want the ability to iterate through a list of mounts efficiently.

06:01.000 --> 06:05.000
If you implement something like that, you have different ways of doing this, right?

06:05.000 --> 06:09.000
So, you could, for example, envision a way of implementing something like list mount.

06:09.000 --> 06:17.000
You take a mount ID, you call a system call that gives you back the next ID,

06:17.000 --> 06:21.000
if it's sequential somehow, and then you keep iterating.

06:21.000 --> 06:23.000
The problem is that doesn't scale, right?

06:23.000 --> 06:25.000
Because the kernel needs to do a lot of work.

06:25.000 --> 06:28.000
It has to allocate a file struct in the kernel.

06:28.000 --> 06:32.000
It has to install that if the file is scripted or table.

06:32.000 --> 06:35.000
It has to return it if the user space.

06:35.000 --> 06:37.000
Then you can call a step mount on it.

06:37.000 --> 06:39.000
Then you close it, file is scripted.

06:39.000 --> 06:41.000
Then you get the next one out and the next one out.

06:41.000 --> 06:44.000
You can make this more efficient by returning an array of file descriptors.

06:44.000 --> 06:46.000
That's kind of wacky in my opinion.

06:46.000 --> 06:49.000
You can do this, get-then-smodel.

06:49.000 --> 06:53.000
The get-then-smodel needs to keep kernel internal state.

06:53.000 --> 06:57.000
So, you retrieve names, which for mounts also doesn't make a lot of sense.

06:57.000 --> 07:01.000
You could probably come up with an API of that, but I don't like the fact that you have to have,

07:01.000 --> 07:05.000
like, need to have a cursor and a kernel, and to remember where you actually were.

07:05.000 --> 07:09.000
With 64 bit non-id is pretty trivial, pretty trivial.

07:09.000 --> 07:13.000
You just give it like the 64-but non-id of the mount that you want,

07:13.000 --> 07:16.000
and then it gives you the children of that mount.

07:16.000 --> 07:21.000
Specifically, the next mount number, because in the kernel internal,

07:21.000 --> 07:24.000
it's kept in an RB tree in a red-backed directory.

07:24.000 --> 07:27.000
Anyway, the gist is via this mechanism.

07:27.000 --> 07:31.000
You can very quickly iterate through a list of mounts in a mount namespace.

07:31.000 --> 07:34.000
By default, in the mount namespace that you currently located in,

07:34.000 --> 07:37.000
so if you're in the initial mount namespace, you lock them to your computer,

07:37.000 --> 07:42.000
you list it, you get, you can iterate through the mount table of the system.

07:42.000 --> 07:48.000
But, same as with set mount, we made it possible so that you can take the idea of a mount namespace,

07:48.000 --> 07:53.000
and then you can say, give me the mounts in that mount namespace of that container,

07:53.000 --> 07:57.000
and then you can iterate from outside through all of the mounts inside of a container.

07:57.000 --> 08:02.000
The really nice thing about this is, because you can also iterate through all mount namespaces,

08:02.000 --> 08:04.000
that's a separate API.

08:04.000 --> 08:07.000
We're not going to be able to cover here, but essentially,

08:07.000 --> 08:14.000
via these two mechanisms, it's possible that you can iterate through all mounts on the system,

08:14.000 --> 08:16.000
like literally all mounts.

08:16.000 --> 08:19.000
I mean, if you wanted to do this right now, what you have to do is like,

08:19.000 --> 08:22.000
iterate through Procophus, I guess.

08:22.000 --> 08:25.000
Look at all of the processes that are running on the system.

08:25.000 --> 08:32.000
Look at the mount info file, then figure out whether it is a separate mount namespace,

08:32.000 --> 08:35.000
and then you can start listing all of them on super inefficient.

08:35.000 --> 08:39.000
With this, you can just iterate through all of the mounts on the whole system.

08:39.000 --> 08:41.000
You can literally dump everything.

08:41.000 --> 08:45.000
It wasn't efficiently possible before.

08:46.000 --> 08:48.000
I've linked to a program that does this.

08:48.000 --> 08:51.000
Yeah, I think it's in the kernel tree.

08:51.000 --> 08:55.000
You can look at this, it's not, let's say.

08:55.000 --> 08:57.000
I'm very rigid with my kernel code.

08:57.000 --> 09:00.000
I'm sloppy with my testing code.

09:00.000 --> 09:04.000
It's not pretty, but it gets a job done.

09:04.000 --> 09:08.000
You can look at it, and then you can implement this for yourself, for example.

09:08.000 --> 09:13.000
But I think, eventually, the mount binary that you use to mount stuff on your system,

09:13.000 --> 09:17.000
nowadays uses the new mount API, and it's very quick at adapting new features,

09:17.000 --> 09:21.000
and eventually you will have the same thing that you can just use that binary to list all

09:21.000 --> 09:23.000
mounts of the system.

09:23.000 --> 09:25.000
But why did we do this?

09:25.000 --> 09:29.000
Like, okay, the one motivation for this was obviously you wanted to make it easy

09:29.000 --> 09:32.000
to retrieve information about mount, without relying on mount info,

09:32.000 --> 09:36.000
but you could make the argument why the hell did you invent like this whole new API?

09:36.000 --> 09:39.000
Why did you just use the mount info file?

09:39.000 --> 09:42.000
You extended it if you wanted to display more information on it.

09:42.000 --> 09:45.000
Well, okay.

09:45.000 --> 09:47.000
System D people are here.

09:47.000 --> 09:50.000
They can probably tell you a thing or two about the mount info file,

09:50.000 --> 09:52.000
but it's actually just doesn't scale.

09:52.000 --> 09:54.000
Like, okay, text parsing user space, you can do that.

09:54.000 --> 09:56.000
That's probably fine.

09:56.000 --> 09:58.000
But for example, you cannot have difts.

09:58.000 --> 10:03.000
Let's say you have a mount or a U mount event, then the mount info file will generate an

10:03.000 --> 10:06.000
notification that user space can listen on, but it's not specific.

10:06.000 --> 10:10.000
It just informs you that something has changed, which means user space is responsible

10:10.000 --> 10:14.000
for passing that whole file and figuring out what exactly changed

10:14.000 --> 10:18.000
with regards to the previous version.

10:18.000 --> 10:23.000
And changes like nowadays, people run 10,000s of containers on a system sometimes.

10:23.000 --> 10:28.000
So, and some container, not runtime, some container,

10:28.000 --> 10:32.000
I guess orchestrators do stuff like, for example,

10:32.000 --> 10:35.000
where they stuff all of the mount in the initial mount name space,

10:35.000 --> 10:37.000
meaning they all show up in the host mount name space,

10:37.000 --> 10:41.000
which means system V on a system constantly gets spawned with

10:41.000 --> 10:42.000
a spanned with mount events.

10:42.000 --> 10:45.000
Technically, it's not very interested in.

10:45.000 --> 10:48.000
But the mount table constantly sees updates updates,

10:48.000 --> 10:52.000
updates, system V is just busy catching up or whatever mount tool you use.

10:52.000 --> 10:54.000
So, that was something that we wanted to get away from.

10:54.000 --> 10:59.000
We wanted to have a mechanism where you can catch up in case you missed events.

10:59.000 --> 11:02.000
You get an indication when you actually missed events,

11:02.000 --> 11:05.000
and it should be way more efficient.

11:05.000 --> 11:10.000
Then what you currently have and not be based on a mount info file.

11:10.000 --> 11:15.000
So, what we used to do or what we have implemented nowadays

11:15.000 --> 11:19.000
is an extension to phenodify.

11:19.000 --> 11:22.000
Because phenodify has basically an notification mechanism.

11:22.000 --> 11:25.000
It's nowadays also available on privilege.

11:25.000 --> 11:30.000
It has all of the really nice handling of dealing with overflows of the queue.

11:30.000 --> 11:34.000
It has like a reliable reporting mechanism of events

11:34.000 --> 11:35.000
and so on.

11:35.000 --> 11:39.000
And so we didn't want to reinvent everything.

11:39.000 --> 11:43.000
I think it was Army's idea back in the days that we used phenodify for this.

11:43.000 --> 11:46.000
And so nowadays, what you can do with phenodify,

11:46.000 --> 11:49.000
you can take a mount namespace file descriptor.

11:49.000 --> 11:51.000
It's easy to get as exposed in Proc.

11:51.000 --> 11:53.000
You can get it via a pity nowadays.

11:53.000 --> 11:55.000
We get to that in a second as well.

11:55.000 --> 11:58.000
Depending on how much time we have.

11:59.000 --> 12:03.000
And then you can use that mount namespace file descriptor.

12:03.000 --> 12:08.000
Registrate with phenodify and then you can via the phenodify API.

12:08.000 --> 12:10.000
Listen for events.

12:10.000 --> 12:12.000
And every time a mount event changes,

12:12.000 --> 12:16.000
you get an event for example for a specific mount.

12:16.000 --> 12:19.000
Sorry for a specific mount namespace ID.

12:19.000 --> 12:22.000
Then you can use that mount to get information about that mount.

12:22.000 --> 12:24.000
What exactly changed for example.

12:25.000 --> 12:30.000
And then you can also use the list mount system called to figure out if any of the child

12:30.000 --> 12:32.000
mounts have changed.

12:32.000 --> 12:35.000
You get an indication if you missed any events.

12:35.000 --> 12:39.000
For example, the queue that overflowed because there were too many mount events.

12:39.000 --> 12:41.000
In that case, you just need to restart.

12:41.000 --> 12:44.000
And like we passed a whole mount name mount namespace table.

12:44.000 --> 12:45.000
But this is way more efficient.

12:45.000 --> 12:49.000
I think the initial testing that we saw is like orders of magnitude.

12:49.000 --> 12:53.000
Faster and more reliable than the mount info file.

12:53.000 --> 12:57.000
So I'm not quite sure if the system is already switched to that,

12:57.000 --> 12:59.000
but that's going to be a future way of it.

12:59.000 --> 13:01.000
You probably wait on lip mount, right?

13:01.000 --> 13:03.000
Lip mount is ready.

13:03.000 --> 13:06.000
We just have a new lazy process.

13:06.000 --> 13:13.000
Anyway, that should get rid of a bunch of bottlenecks in system,

13:13.000 --> 13:14.000
do you hopefully.

13:14.000 --> 13:18.000
But obviously you can also make use of this for example.

13:18.000 --> 13:22.000
And if you have a container manager that manages a single container,

13:22.000 --> 13:25.000
it's usually how they implement it with run C,

13:25.000 --> 13:26.000
but with other container run times.

13:26.000 --> 13:29.000
Then that could register the mount namespace as well.

13:29.000 --> 13:32.000
And register for mount events inside of that mount namespace as well.

13:32.000 --> 13:34.000
If you wanted to get really, really crazy,

13:34.000 --> 13:38.000
there was one of the original ideas that we once had is you could build,

13:38.000 --> 13:41.000
like use a mechanism like this to supervise mount events,

13:41.000 --> 13:45.000
but probably not something that we're going to need anymore in the future.

13:45.000 --> 13:49.000
But anyway, mount notifications and the mount may be API.

13:49.000 --> 13:54.000
So system B and due to the length of probably going to be the first users

13:54.000 --> 13:56.000
of this lip mount is ready.

13:56.000 --> 13:57.000
It already uses this.

13:57.000 --> 13:59.000
System B is being lazy.

13:59.000 --> 14:04.000
But hopefully we will get them very soon.

14:04.000 --> 14:06.000
So in the interest of time,

14:06.000 --> 14:09.000
because I talked about mount for quite a bit.

14:09.000 --> 14:13.000
So I have a bunch of mount related topics that could be interesting.

14:13.000 --> 14:16.000
I have a bunch of pitifice related topics, so pitifties.

14:17.000 --> 14:20.000
And API did we build as well.

14:20.000 --> 14:26.000
And I have some very new recent features that we've been developing just for this cycle.

14:26.000 --> 14:32.000
So is there any preference on what we want to talk about pitifties for the mount stuff?

14:32.000 --> 14:33.000
Anywhere stuff?

14:33.000 --> 14:35.000
Anywhere stuff?

14:35.000 --> 14:36.000
Okay.

14:36.000 --> 14:42.000
So we'll skip over a bunch of stuff.

14:42.000 --> 14:43.000
Okay.

14:43.000 --> 14:46.000
So this is very recent.

14:46.000 --> 14:48.000
When you create mount name spaces in a container,

14:48.000 --> 14:50.000
I think this is going to be pretty interesting.

14:50.000 --> 14:52.000
I actually could do a demo for this.

14:52.000 --> 14:58.000
But one of the problems is nowadays when you spawn a container

14:58.000 --> 15:01.000
and you create a new mount name space,

15:01.000 --> 15:04.000
it's probably the first thing that you need to do.

15:04.000 --> 15:06.000
Like if you don't use any other main spaces at all,

15:06.000 --> 15:10.000
you're probably going to use the mount name space simply because like you don't want all of those

15:11.000 --> 15:14.000
mounts to leak onto the host, right?

15:14.000 --> 15:16.000
It should be live in it's separate world.

15:16.000 --> 15:19.000
The problem with this is every time you do the system call is called

15:19.000 --> 15:20.000
Unshare Clone USS.

15:20.000 --> 15:22.000
It's called Clone UNS.

15:22.000 --> 15:26.000
And there is a binary that you probably use in a shell.

15:26.000 --> 15:29.000
Unshare dash dash mount for example.

15:29.000 --> 15:32.000
The problem with this is it clones the whole mount table.

15:32.000 --> 15:33.000
Like everything that you have.

15:33.000 --> 15:35.000
If on your host system you have 30 mounts,

15:35.000 --> 15:37.000
probably not that big of a deal.

15:37.000 --> 15:40.000
If you have 1,000 or 2,000 or 3,000 mounts,

15:40.000 --> 15:44.000
then the kernel copies 3,000 mounts into a new mount name space.

15:44.000 --> 15:46.000
And then when you spawn a container,

15:46.000 --> 15:49.000
you usually don't want to lose these 3,000 mounts on the host, right?

15:49.000 --> 15:52.000
The point is that you isolate the container from the host.

15:52.000 --> 15:55.000
You don't want it access to all of the crap the host has available.

15:55.000 --> 15:58.000
So what you usually do is you set up a new root of est

15:58.000 --> 16:00.000
for the container somehow,

16:00.000 --> 16:02.000
barit container something usually.

16:02.000 --> 16:06.000
And then you switch out the current root of est in that name space

16:06.000 --> 16:07.000
that new root of est.

16:07.000 --> 16:08.000
The container is supposed to use.

16:08.000 --> 16:10.000
Container D dust is.

16:10.000 --> 16:12.000
And run C dust is.

16:12.000 --> 16:14.000
Inca dust is.

16:14.000 --> 16:16.000
The community dust is like everything.

16:16.000 --> 16:17.000
Everything dust is the thing is.

16:17.000 --> 16:21.000
It gets rid of all of the mounts that you copied before.

16:21.000 --> 16:26.000
So you copied 2000 mounts for completely no reason at all.

16:26.000 --> 16:28.000
The problem with this is not just that.

16:28.000 --> 16:29.000
This is in performant.

16:29.000 --> 16:32.000
The problem is that the way the kernel is constructed currently.

16:32.000 --> 16:36.000
Because like we have relationships between different mount names spaces,

16:36.000 --> 16:38.000
they can propagate mounts to each other.

16:38.000 --> 16:41.000
So that you can share stuff that is one global,

16:41.000 --> 16:44.000
one specific variable block that we can't easily get rid of.

16:44.000 --> 16:47.000
So that means you keep blocking on that log,

16:47.000 --> 16:51.000
waiting for someone to unshare mount names space with like thousands of mounts

16:51.000 --> 16:54.000
that you're never going to use completely useless, right?

16:54.000 --> 16:57.000
So we thought, okay, you idea how to get this,

16:57.000 --> 16:59.000
how to get this done.

16:59.000 --> 17:01.000
So we don't have to do this anymore.

17:01.000 --> 17:03.000
And this is the new API for this.

17:03.000 --> 17:06.000
It's like open tree name space and FSMount name space.

17:06.000 --> 17:09.000
That works with two new system calls we have in the kernel.

17:09.000 --> 17:13.000
And what they effectively do is you can specify a path.

17:13.000 --> 17:16.000
You can specify a path or a file descriptor.

17:16.000 --> 17:18.000
I should say.

17:18.000 --> 17:21.000
And when you specify that for open tree,

17:21.000 --> 17:24.000
the name space, if you specify that file descriptor or that flag,

17:24.000 --> 17:25.000
you pass that flag.

17:25.000 --> 17:29.000
Then the kernel will create a copy of that specific mount or mount tree.

17:29.000 --> 17:33.000
And then create a new mount name space and mount it.

17:33.000 --> 17:36.000
Place it as the root of the root of that mount name space,

17:36.000 --> 17:39.000
which means you just copy exactly the mount or the number of mounts

17:39.000 --> 17:42.000
that you need for the container.

17:42.000 --> 17:44.000
And FSMount is a similar concept.

17:44.000 --> 17:47.000
Only that FSMount deals with actually mounting a file system.

17:47.000 --> 17:48.000
So open tree clones.

17:48.000 --> 17:49.000
It's like binding mounting.

17:49.000 --> 17:51.000
It clones an existing mount tree.

17:51.000 --> 17:54.000
And FSMount and the associated API with this.

17:54.000 --> 17:56.000
It creates a new file system mount.

17:56.000 --> 18:00.000
So the stuff that you do when you say, I want to mount a new XFS file system.

18:00.000 --> 18:02.000
Or a new uRof file system or whatever.

18:02.000 --> 18:04.000
And it's the same concept.

18:04.000 --> 18:05.000
Like you mount that.

18:05.000 --> 18:08.000
And then it becomes the root of a new mount name space.

18:08.000 --> 18:12.000
I can try and show you what I mean.

18:12.000 --> 18:14.000
Unprepared life.

18:14.000 --> 18:17.000
I will say me prepared life then also.

18:17.000 --> 18:19.000
Which may look.

18:19.000 --> 18:24.000
There it is.

18:24.000 --> 18:27.000
Okay, you can somewhat see this.

18:27.000 --> 18:31.000
I'll give you a second.

18:31.000 --> 18:35.000
I'm sorry.

18:35.000 --> 18:40.000
Okay.

18:40.000 --> 18:51.000
Okay.

18:51.000 --> 18:54.000
Okay.

18:54.000 --> 18:56.000
This is just a VM.

18:56.000 --> 18:59.000
Actually, I already created the demo folder.

18:59.000 --> 19:09.000
And here I have a static busy box binary in busy box.

19:09.000 --> 19:14.000
Nonodynamic executable.

19:14.000 --> 19:17.000
So if you do a traditional way, right?

19:17.000 --> 19:18.000
Unsharing a mount name space.

19:18.000 --> 19:22.000
You do pseudo, unshared, dashed, shmount.

19:22.000 --> 19:27.000
You can see if you do cat prox, self mount info.

19:27.000 --> 19:32.000
All the stuff that you had on the host has been copied into that mount name space.

19:32.000 --> 19:35.000
And now you can switch out a new root of S in here.

19:35.000 --> 19:37.000
And then you could technically start your container.

19:37.000 --> 19:42.000
So I have a program here.

19:42.000 --> 19:47.000
There is also not very pretty.

19:47.000 --> 19:49.000
I'm going to be sure if I have a winner on here.

19:49.000 --> 19:54.000
But what exactly does is just what I talked about like open free namespace.

19:54.000 --> 19:55.000
Here you can see it.

19:55.000 --> 19:56.000
It's like a new flag.

19:56.000 --> 19:58.000
It defines the system call.

19:58.000 --> 20:00.000
You can give it a path.

20:00.000 --> 20:03.000
It calls open free, which creates a detached mount.

20:03.000 --> 20:05.000
So it creates effectively a bind mount.

20:05.000 --> 20:08.000
Not visible anywhere in your current mount name space.

20:08.000 --> 20:11.000
Then it returns a file descriptor to a mount name space.

20:11.000 --> 20:17.000
You set a nest into that mount name space and then it tries to execute a busy box binary.

20:17.000 --> 20:19.000
That's all that it does.

20:19.000 --> 20:21.000
So we can create a directory somewhere.

20:21.000 --> 20:24.000
Let's say here, for the sake of this like a demo.

20:24.000 --> 20:27.000
And then I do a bin busy box.

20:27.000 --> 20:30.000
And move that into demo.

20:31.000 --> 20:37.000
And then I do ignoring a user nest from now demo.

20:39.000 --> 20:41.000
Okay, there's nothing in there.

20:41.000 --> 20:42.000
Other than it.

20:42.000 --> 20:43.000
Busy box binary.

20:43.000 --> 20:45.000
I could do make the a proc.

20:45.000 --> 20:46.000
Oh god.

20:46.000 --> 20:48.000
A busy box mount, right?

20:48.000 --> 20:49.000
A proc.

20:49.000 --> 20:51.000
Proc.

20:51.000 --> 20:52.000
Sush.

20:52.000 --> 20:53.000
Proc.

20:53.000 --> 20:54.000
Okay.

20:54.000 --> 20:56.000
And then I can do cat proc.

20:57.000 --> 20:58.000
Not info.

20:58.000 --> 21:00.000
Ah.

21:00.000 --> 21:03.000
What's all you have in there?

21:03.000 --> 21:05.000
Thank you.

21:09.000 --> 21:15.000
If we port various thank you, I'm surprised this worked.

21:15.000 --> 21:18.000
But okay, I'll take it.

21:21.000 --> 21:24.000
And we have a few more minutes, actually.

21:24.000 --> 21:31.000
So yeah, if once you've switched your container runtime over to this, there's some limits to this.

21:31.000 --> 21:33.000
That we don't need to get attacked.

21:33.000 --> 21:34.000
Don't have time to get into.

21:34.000 --> 21:41.000
But if you switch your container runtime over to this, you will probably get a lot of performance benefits out of it in terms of scaling.

21:41.000 --> 21:44.000
Because you're not wasting cyclists anymore for no good reason.

21:44.000 --> 21:45.000
Yes.

21:47.000 --> 21:52.000
So to make use of this in your actual container runtime, you're going to have to switch your entire

21:53.000 --> 21:54.000
mount setup.

21:54.000 --> 21:55.000
Yes.

21:55.000 --> 21:56.000
To have these, right?

21:56.000 --> 21:57.000
Yes.

21:57.000 --> 21:59.000
Because otherwise you have to shut up my namespace before.

21:59.000 --> 22:03.000
If you want the goodies, if you want the goodies, then you need to do some work.

22:03.000 --> 22:09.000
So the thing is that's what I said before, then the mount API is completely FD based.

22:09.000 --> 22:12.000
And traditionally like you had a single system called mount, right?

22:12.000 --> 22:13.000
And mount took paths.

22:13.000 --> 22:15.000
You could technically use file descriptors.

22:15.000 --> 22:20.000
If you use the trick where you did proc self FD and then past it, the FD this way to mount.

22:20.000 --> 22:25.000
But the new mount API only gives you the option to work with file descriptors.

22:25.000 --> 22:27.000
I've got maybe a dumb question.

22:27.000 --> 22:31.000
But why wasn't that at its clone 3?

22:31.000 --> 22:36.000
So if anything, I would like to clone 3, by the way, this is the mount.

22:36.000 --> 22:39.000
The mountable I would like to have as part of the clone.

22:39.000 --> 22:40.000
Yeah.

22:40.000 --> 22:42.000
You could see that I could just API wise.

22:42.000 --> 22:45.000
I find it nice in the actual FS APIs.

22:45.000 --> 22:47.000
But you could actually do it twice.

22:47.000 --> 22:50.000
And you're not in the first need to set up the mount namespace.

22:50.000 --> 22:53.000
And then still do your clone for all of the other ones.

22:53.000 --> 22:55.000
All do all of the engineers separately.

22:55.000 --> 22:57.000
You could extend clone 3 with that as well.

22:57.000 --> 22:59.000
Like if where it makes sense.

22:59.000 --> 23:02.000
But most of the setups, most of the setups that I've seen.

23:02.000 --> 23:07.000
Even sometimes in Lexi, we get away with unsharing, with cloning it right away.

23:07.000 --> 23:11.000
But like, for example, most stuff that I've seen.

23:11.000 --> 23:15.000
Obviously I have a system D centric eye in a sense, oftentimes.

23:16.000 --> 23:20.000
You're just unsure. You're just always unsure because it gives you more control.

23:20.000 --> 23:24.000
What is owned by what and the ordering.

23:24.000 --> 23:28.000
But there is certainly not necessarily something that would prevent us from doing something like that.

23:28.000 --> 23:30.000
Cloning 3 is accessible.

23:30.000 --> 23:31.000
So sure.

23:31.000 --> 23:34.000
Because I could see that that's being a good security measure.

23:34.000 --> 23:35.000
Some people too.

23:35.000 --> 23:37.000
Particularly, you know, clones 3.

23:37.000 --> 23:38.000
And by the way, I don't need to fix these them.

23:38.000 --> 23:40.000
So don't bother.

23:41.000 --> 23:45.000
So one of the last things that I'm going to cover.

23:45.000 --> 23:48.000
Very quickly or not.

23:53.000 --> 23:57.000
Is I want to say that the nice thing about this was clone 3 is that it would allow you.

23:57.000 --> 23:58.000
We would need to do more extensions for it.

23:58.000 --> 23:59.000
We'd cost it before.

23:59.000 --> 24:00.000
I can't hear you.

24:00.000 --> 24:02.000
We would need to do more extensions for this.

24:02.000 --> 24:05.000
But the nice thing about doing it with the amount of these is that if you can.

24:05.000 --> 24:09.000
If you can mount on supplement method, you can construct the entire amount tree as a thing.

24:09.000 --> 24:11.000
And then it's heading us into it.

24:11.000 --> 24:12.000
And with the clone 3 thing.

24:12.000 --> 24:15.000
You're stuck in the situation where you don't have extra amount.

24:15.000 --> 24:18.000
Which means you have to come work around for getting stuff like you can't mount proc.

24:18.000 --> 24:19.000
Because no proc is visible.

24:19.000 --> 24:21.000
And you have to deal with the crap, which is like.

24:21.000 --> 24:24.000
I'm sure that it would make sense to do it in the case where you know you can figure that out.

24:24.000 --> 24:25.000
But for.

24:25.000 --> 24:27.000
I'd like this API.

24:27.000 --> 24:28.000
I would buy it.

24:28.000 --> 24:30.000
I'd prefer this one.

24:30.000 --> 24:33.000
So last piece.

24:33.000 --> 24:35.000
Is nullFS.

24:35.000 --> 24:38.000
So when you boot on a.

24:38.000 --> 24:44.000
When you boot on a regular kernel, then the first thing that the kernel has is like the real rootFS.

24:44.000 --> 24:47.000
You would like to call it like the terminology is confusing.

24:47.000 --> 24:49.000
Real rootFS.

24:49.000 --> 24:54.000
I mean the thing, the rootFS that you see when you do LSAL is mounted on.

24:54.000 --> 24:56.000
That serves as the parent mount.

24:56.000 --> 24:59.000
The kernel like uses this in dual ways though.

24:59.000 --> 25:00.000
It's first of all.

25:00.000 --> 25:03.000
It uses it to handle the inner RAMFS case.

25:04.000 --> 25:09.000
So when you specify an inner RAMFS, the kernel either creates.

25:09.000 --> 25:10.000
A mount.

25:10.000 --> 25:13.000
Real rootFS that is a tempFS or a RAMFS.

25:13.000 --> 25:17.000
But itself that mount doesn't have a parent meaning you cannot unmount it.

25:17.000 --> 25:21.000
Meaning you unpack the inner RAMFS into that tempFS or RAMFS.

25:21.000 --> 25:23.000
You then have a bunch of crap in there.

25:23.000 --> 25:26.000
And so now you have two options.

25:26.000 --> 25:27.000
You mount the rootFS.

25:27.000 --> 25:30.000
You pivot root into the rootFS and you leave everything.

25:30.000 --> 25:33.000
That is in the inner RAMFS under there.

25:33.000 --> 25:36.000
That's performance wise, kind of nice.

25:36.000 --> 25:45.000
The problem is anyone who somehow manages to get the hands on the real rootFS will see whatever you unpacked in there.

25:45.000 --> 25:49.000
So if you have any specific sensitive binary or something, then that will remain there.

25:49.000 --> 25:58.000
So usually what any system implementations do is that they do an RMRF and empty everything from the inner RAMFS, then mount the real rootFS and then fire away.

25:59.000 --> 26:02.000
Obviously it doesn't make it just wasted effort, right?

26:02.000 --> 26:08.000
So it also means you can pivot root because like, so pivot root is the system also which is memory.

26:08.000 --> 26:10.000
You're also wasting memory.

26:10.000 --> 26:13.000
But you're also wasting memory by doing that.

26:13.000 --> 26:15.000
Yeah, you're wasting memory.

26:15.000 --> 26:22.000
But anyway, the pivot root system called switches, which is a rootFS out underneath you, right?

26:22.000 --> 26:28.000
And so you can do this because it requires that the thing is already a mount point.

26:28.000 --> 26:37.000
Even if you bind the inner RAMFS onto itself, it still wouldn't help you because you'd still leave everything in that directory.

26:37.000 --> 26:48.000
So nullFS essentially is a concept to say, do away with this, all that we really want is something that serves as a mount point later for the real rootFS.

26:48.000 --> 26:54.000
And nullFS is like completely catatonic, catatonic in a sense that you can create anything in there.

26:54.000 --> 27:01.000
You can create anything in there, you can change mount, options you can do anything meaningful with it all.

27:01.000 --> 27:12.000
It's only purpose is to serve as a mount point, which also means the count on mount a tempFS, a slash RAMFS on there and you unpack a bunch of crap in there that you need.

27:12.000 --> 27:15.000
And then pivot root suddenly works.

27:15.000 --> 27:18.000
You can just switch it out, you don't need to do RMRF anymore.

27:18.000 --> 27:24.000
The kernel will just do away with everything that is in that tempFS when you own mount it.

27:24.000 --> 27:29.000
You have your real rootFS, you have your rootFS on there, and then you just boot.

27:29.000 --> 27:33.000
Also works if you have an inner ID, but inner ID systems are fairly rare nowadays.

27:33.000 --> 27:44.000
The other nice thing is that nowadays we have the concept in mount name spaces where you lock the rootFS mount on top of the real rootFS simply because of the thing that I said before.

27:44.000 --> 27:51.000
There could be stuff in there because it's a tempFS, or people could create stuff in there if they got access to it.

27:51.000 --> 27:55.000
So it's locked, you can unmount it, it makes a couple of things more brittle.

27:55.000 --> 28:02.000
If you have anotherFS, you know always empty, nothing in there doesn't really matter if you can unmount it.

28:02.000 --> 28:07.000
So you can unlock it, it makes a bunch of handling in the kernel easier as well.

28:07.000 --> 28:10.000
Okay, so there's a bunch more I could talk about.

28:10.000 --> 28:16.000
I'm hoping hopefully I was understandable, and it was interesting.

28:16.000 --> 28:20.000
And I hope you have some interesting questions.

28:29.000 --> 28:34.000
How do you get this slide?

28:35.000 --> 28:40.000
I'm uploading them after the talk, I'm sorry, I probably forgot this.

28:40.000 --> 28:48.000
So usually on first them, you can get this slide for any of the talks directly from schedule.

28:48.000 --> 28:53.000
Another fast stuff is by default, right? I don't have to open.

28:53.000 --> 28:59.000
And so the preference was to make it, people were to make it unconditional.

28:59.000 --> 29:04.000
I had the option, I also added the option to make it conditional.

29:04.000 --> 29:09.000
You can use it like this causes any issues, and we could make it a command line option.

29:09.000 --> 29:11.000
But we don't need to do anything in system need to make.

29:11.000 --> 29:15.000
No, system need does everything correctly, it pivot rules.

29:15.000 --> 29:20.000
It tries pivot rules first, and if that doesn't work, it just moves once, but even if you don't do that,

29:20.000 --> 29:25.000
even if you say you always rely on the fact that you don't need anything at all,

29:25.000 --> 29:31.000
and you said I can't pivot rule, then all that happens is that your root of this has another parent mount,

29:31.000 --> 29:34.000
but it doesn't matter, you can't unmount it, you can't get to it.

29:34.000 --> 29:36.000
It's like irrelevant.

29:36.000 --> 29:38.000
Thank you.

29:43.000 --> 29:46.000
You're doing like I choose your own adventure talk.

29:46.000 --> 29:49.000
I'll take the mic.

29:49.000 --> 29:52.000
No, I know.

29:52.000 --> 29:54.000
I know.

29:54.000 --> 29:56.000
Let's start.