WEBVTT

00:00.000 --> 00:13.000
So, our fail is going to be talking about rust in Miracuria, so please give them a welcome.

00:13.000 --> 00:23.000
Hello everyone, thank you. I'm going to do one clap because my last talk was out of sync.

00:23.000 --> 00:29.000
Thank you. Welcome. I'm going to be talking about the usage of rust within Miracuria.

00:29.000 --> 00:38.000
Why we think using rust is the right choice for us and why some of the lessons that you might take away from this in your projects.

00:38.000 --> 00:48.000
My name is Rafael Gomez. I work at a small company called Oktoberz. We're a consulting company and we specialize in version control and specifically on Miracuria.

00:48.000 --> 01:00.000
It is version control software. It happened at the same time as Git. It was created then in 2005, written at the time in Python with a little bit of C.

01:00.000 --> 01:10.000
For this room, it's had rust extensions since 2018 and a pure rust fast path since 2020.

01:11.000 --> 01:23.000
Actually, give a presentation on that fast path in 2020. Here. Since then, the amount of rust that we've had in our codebase has grown steadily.

01:23.000 --> 01:33.000
This is based off of lines of code, so it's not really the best metric ever, but it gives you a rough idea that rust is now a pretty significant part of Miracuria.

01:33.000 --> 01:40.000
As around the half a million lines of code all included, the C bits have basically not moved that much in the past 10 years.

01:40.000 --> 01:47.000
Aside from some compatibility stuff, and rust has kept growing.

01:47.000 --> 01:58.000
A version control system is kind of a weird database. I'm going to talk a little bit about the constraints within the project and why we think that rust is a good fit.

01:58.000 --> 02:05.000
Specifically, because we do stuff that is highly specialized to the data that we are handling.

02:05.000 --> 02:13.000
One example of that is Miracuria's manifest. The manifest is a data structure that for each revision for each commits.

02:13.000 --> 02:19.000
It holds the entire list of every file that this revision has. It's similar to a Git tree.

02:20.000 --> 02:25.000
The manifest has a compression ratio of up to 50,000 times.

02:25.000 --> 02:43.000
If you have a very, very large repository, some private stuff, and for example, the Modzilla one, it stores about one petabyte of completely uncompressed data in about 20 gigabytes of actual on disk.

02:43.000 --> 02:59.000
We have a plan for a zero copy version that has the exact same characteristics, meaning that we have a 50,000 times compression ratio, but we don't even have to allocate or do anything to handle with it.

02:59.000 --> 03:07.000
This is a proof of concept, but the current implementation already has that property.

03:07.000 --> 03:26.000
Miraculous scales comfortably. We have another example of a large private repository. It has tens of millions of change sets, millions of files, everything fully replicated using vanilla material, and it's working fine.

03:26.000 --> 03:41.000
Those are kind of the scaling and the performance aspects of it, but why do we use rust specifically for those on disk data structures? Why couldn't we use Python?

03:41.000 --> 03:45.000
Because it turns out that Python isn't really great for binary access.

03:45.000 --> 03:54.000
Quick show of hands who has heard about the struct module of Python, decent fifth of the room I would say.

03:54.000 --> 04:00.000
The struct module allows you to do unpacking and packing of binary data.

04:00.000 --> 04:11.000
This is an example of a node in one of our formats, and this means you have a big Indian of something that has two bytes, then four, then four, whatever.

04:11.000 --> 04:19.000
So you define your format and what that gives you is a way of unpacking and packing binary data.

04:19.000 --> 04:31.000
So this example is for what I've just showed you is for a data structure called the Durstates that corresponds to the working copy.

04:31.000 --> 04:37.000
This means whatever you're working on, your files or folders in your currently checked out revision.

04:37.000 --> 04:44.000
This is a zero-copy persistent tree, and it's useless in Python.

04:44.000 --> 04:57.000
It was written with rust in mind, because something like this would not have been possible or practical, because we have a Python implementation, but it's not practical for scaling with Python.

04:57.000 --> 05:02.000
So I'm getting back to this specific example.

05:02.000 --> 05:07.000
This creates a 13-topple for each node.

05:07.000 --> 05:11.000
We have millions of them, or at least hundreds of thousands of them.

05:11.000 --> 05:14.000
14 objects per node, that means.

05:14.000 --> 05:24.000
If you want to do anything with any amount of encapsulation and for your code to be somewhat usable, you will probably have a class instance per node.

05:24.000 --> 05:32.000
Every node you would put into a recursive disk, a dict, sorry, because you need parent and child relations, et cetera.

05:32.000 --> 05:36.000
That means a lot of allocation, that means garbage collection.

05:36.000 --> 05:43.000
So using such a data structure is not practical using Python, but what about the sea extensions, you might say.

05:43.000 --> 05:50.000
Maybe not in this room that much, but historically it's interesting to understand what happens.

05:50.000 --> 06:00.000
Cross-platform across OS work is harder, because you have to do more manual things that you would in Python, which is pretty good at cross-platform compatibility.

06:00.000 --> 06:05.000
At some point you will still need to interact with the huge Python code base that we have.

06:05.000 --> 06:15.000
So it means that your limits have to go up into a higher level abstraction, and you have to write that in sea.

06:15.000 --> 06:26.000
And it turns out that a sea rewrite is pretty much out of the question, because contributors have felt uneasy in general writing high-level logic in sea.

06:26.000 --> 06:33.000
And this is not to say that we haven't had very competent sea programmers within the project.

06:33.000 --> 06:44.000
The creator of the project, herself, Olivia Maco, did a whole bunch of stuff in the Linux kernel and she was in still is probably a very competent sea programmer.

06:44.000 --> 06:59.000
But she said, at the initial phase of Recurling 2005 and 6, that she chose Python, because she was able to do things in two weeks that would have taken her month in sea, and so she would just not have done it.

06:59.000 --> 07:04.000
So for us, we've established that data access is fundamental.

07:04.000 --> 07:17.000
And this is where Rust really fits into the picture. Rust helps us build on top of the data access, which is the most crucial part for us.

07:17.000 --> 07:20.000
This is generally my problem.

07:20.000 --> 07:30.000
We have a giant code base, and what my head is able to fit all at once, that I can keep inside of my context, is actually a pretty small part of the code base.

07:30.000 --> 07:35.000
Maybe that's a neat problem. I think that's a general problem for most people.

07:35.000 --> 07:46.000
There's no way that you can have the entire code base in context, which means that you need your tooling to help you do local reasoning, thinking about your code locally.

07:46.000 --> 08:11.000
And to some extent, this is not specifically particular to Rust. You could be using another programming language like OCaml or whatever else programming language that has a strong enough type system that it can enforce in variance of whatever you're trying to build and say this state is not possible, and it's not possible at compile time, and it helps you define all of those rules.

08:11.000 --> 08:20.000
But you may have heard about something very specific to Rust that is the borough checker for aliasing and parallelization.

08:20.000 --> 08:40.000
This also, like it's a layer on top of the type system, and the type system is built in a quite careful way picking out things that other languages have done and doing a very good mix of a whole bunch of very old ideas and some a little bit newer, and one specific novel idea, which is the borough checker.

08:40.000 --> 09:04.000
Which allows you to think very locally, I can look at a diff during code review, and I can say, oh, you're touching this line or using this particular struct, and it's very easy to understand the implications without having to get the entire code base within your head, or at least the entire space within that code base.

09:05.000 --> 09:18.000
For me, that means that refactoring is easier. It means that code review is much easier, especially for complex systems. I'm not talking about small scripts, I'm really talking about larger projects.

09:18.000 --> 09:36.000
And this is kind of the bonus profiting and tracing is easier, much easier than Python, I don't think Rust has any advantage of all other compiled times, though the tracing ecosystem is actually quite nice and if any of you have contributed to tracing, thanks.

09:36.000 --> 09:50.000
So, let's talk about the ecosystem. Rust helps you, like the language helps you, but the ecosystem is kind of linked together with its ecosystem.

09:50.000 --> 10:13.000
Cargo was built after many other projects have tried and different ideas of how do you build a package manager. Those projects may be started when access to the internet was not a given, for example, or the way that you would distribute libraries is by sending them via email or

10:14.000 --> 10:26.000
It was a very different time. Cargo had the luxury of building itself on top of the corpses of other package managers, some of which are still alive somehow.

10:26.000 --> 10:38.000
And thankfully this helps us get a whole bunch of problems out of the way. It has quite good features like yanking.

10:38.000 --> 10:48.000
Does anybody not know what yanking is in the context of Cargo? Okay, two people. That's cool.

10:48.000 --> 10:59.000
So yanking means you have a bad version of a crate that you've published and you want to remove that and say, please don't install the script.

10:59.000 --> 11:04.000
It's not a please. It's like don't install the script. It has some problem, some security issue, whatever.

11:04.000 --> 11:12.000
And so Cargo, by default, will not pull a yank script and will refuse to install it unless you specifically do something with the lock file.

11:12.000 --> 11:22.000
Speaking of, you have a lock file which allows you to pin your dependencies and which now we take for granted and everybody's like, yeah, of course you have a lock file.

11:22.000 --> 11:26.000
But that really was not clear that not that long ago.

11:26.000 --> 11:33.000
Also in general, and that might just, I mean, everything is my opinion, but this is maybe even more my opinion.

11:33.000 --> 11:41.000
I think in general, the ecosystem is of higher quality than most of others that I've seen.

11:41.000 --> 11:52.000
Part of that is that it's a newer language, it's a newer ecosystem, so all of the individual libraries and projects can build on top of the mistakes that others have made.

11:52.000 --> 11:56.000
So it's an easier thing, but also I feel like there's a shift.

11:56.000 --> 12:03.000
Whenever you write something that you want to publish in the Rust world, the first question that everybody's going to ask is, is it blazingly fast and blah blah blah?

12:03.000 --> 12:13.000
And so you have some amount of pressure to build something that follows the latest research trends that has a nice API, et cetera.

12:13.000 --> 12:18.000
There's most scrutiny, so I think the ecosystem in general has higher quality.

12:19.000 --> 12:33.000
So when, for example, we've improved the way that we do Diffing and Mercuro, we just took a random Diffing library and it just work and we didn't have to think too hard about it because it's performance characteristics were just fine.

12:33.000 --> 12:37.000
And that's I think pretty nice.

12:37.000 --> 12:45.000
In general, the ecosystem is focused on performance, which for us is very important, but that's not the only reason.

12:45.000 --> 12:50.000
I want to give a quick shout out to Diff.rs.

12:50.000 --> 12:52.000
Who knows about this website?

12:52.000 --> 12:54.000
Okay.

12:54.000 --> 12:56.000
It's a very simple idea.

12:56.000 --> 12:59.000
You can do Diffing across create versions.

12:59.000 --> 13:05.000
You can just say, oh, I want to update my dependencies, which I do every start of the cycle.

13:05.000 --> 13:10.000
What is the actual source code difference between the crates that have been published?

13:10.000 --> 13:20.000
This is, for me, very useful to vet that the dependencies are doing what they say they are doing on the tin.

13:20.000 --> 13:31.000
I think in a more general sense, Rust has changed the way the way that we act on the project.

13:31.000 --> 13:39.000
There were many times in the Mercuro project where certain solutions were not pursued,

13:39.000 --> 13:46.000
because the implementation was too scary, or the model didn't fit the reality.

13:46.000 --> 13:53.000
And because basically, whenever you want to do something in Python, you had to say, yeah, but it has to go into a dict after that.

13:53.000 --> 13:59.000
So it kind of misses the point people didn't want to write complex C stuff.

13:59.000 --> 14:07.000
So using Rust means that we are now not scared of advanced ideas for complex problems.

14:07.000 --> 14:14.000
Some of the problems that we have have an inherent complexity that you cannot reduce the problem further.

14:14.000 --> 14:16.000
It's just complex in itself.

14:16.000 --> 14:22.000
So having a simple solution is nice, but sometimes there is no actual simple solution.

14:22.000 --> 14:26.000
You have to have a solution that matches the complexity of the problem.

14:26.000 --> 14:32.000
And before that, it was not as obvious for us in that sense.

14:32.000 --> 14:39.000
And Rust enables us to simply think about the problem and say, this is possible.

14:39.000 --> 14:41.000
An example of that.

14:41.000 --> 14:56.000
Seven, something years ago, there was a contributor who talked about the, on the main list about a proof of concept that they had done to speed up the status command.

14:56.000 --> 15:00.000
When you say, what are the changes on my working company?

15:00.000 --> 15:03.000
What are the files that have changed, et cetera?

15:03.000 --> 15:11.000
Because status was slow and they say, oh, I have a pure Rust version and it goes some number of tens of milliseconds for some number of files.

15:11.000 --> 15:17.000
I don't exactly remember, but I remember that someone told them, oh, this is, this performance is impossible.

15:17.000 --> 15:24.000
It was somebody quite high up at a large company who said, no, this is not possible.

15:24.000 --> 15:28.000
Even though they had access to the code and could test it themselves.

15:28.000 --> 15:36.000
And what happened is we upstream that, or a better version of that, that was actually four times faster than the proof of concept.

15:36.000 --> 15:45.000
Meaning that we ended up with something, if you have a pretty beefy machine, you can do a hundred milliseconds of status for about a million files,

15:45.000 --> 15:58.000
which is as fast as we can get to do anything in Python plus the very small amount of modules that we import.

15:58.000 --> 16:04.000
By that point, we've already lost and we've done the status for one million files on the exact same machine.

16:04.000 --> 16:07.000
This is about a hundred milliseconds for both.

16:08.000 --> 16:17.000
By the way, this is all without FS manager, it's all in process, it's just pure like boot of, boot of Rust, like exec, a Rust program.

16:17.000 --> 16:24.000
Do the status, give back the answer, close the status, this is all done in a hundred milliseconds.

16:24.000 --> 16:30.000
Some of you may have heard this expression, fall into the pit of success.

16:30.000 --> 16:33.000
Look at how happy they are.

16:33.000 --> 16:45.000
For those of you who may not be familiar, falling into the pit of success means that if you have a well-designed system, it's easy to do the right things and it's annoying to do the wrong things.

16:45.000 --> 16:53.000
And I think somehow Rust, its ecosystem, helps us fall a little bit more into the pit of success.

16:53.000 --> 16:58.000
So let me recap, why Rust?

16:59.000 --> 17:02.000
Optimal data structures, and optimal algorithms.

17:02.000 --> 17:12.000
This means that we can go from a research paper, which we do work on research papers with people whose entire job is graph theory and that kind of stuff.

17:12.000 --> 17:22.000
Go from the model of the research paper and say, I know how to implement it and I'm not scared of doing that and maintaining that afterwards.

17:22.000 --> 17:36.000
Because it's not just a matter of writing some fancy stuff so you can brag about it and it's also because you know that you've been maintaining 20 years of code and that you will likely be maintaining it for another 20.

17:36.000 --> 17:49.000
And that is not scary anymore because of all of what Rust has that is unique and that it's aligning a whole bunch of features that are good in many languages in a very nice way.

17:49.000 --> 17:54.000
It's also providing some of its own specificity.

17:54.000 --> 18:00.000
This means that it's unlocking previously in practical design is kind of what I've been talking about this whole time.

18:00.000 --> 18:14.000
And maybe, you know, there's the fact that if you build a higher quality code base and a higher quality project then it kind of pushes you in the way that everything else has to become higher quality as well.

18:14.000 --> 18:22.000
Sometimes we've read written stuff to prepare for a Rust extensions and the Python got a lot better because the architecture was better.

18:22.000 --> 18:28.000
And nobody had ever really looked at it because everybody was like, oh, it's Python slow not always.

18:28.000 --> 18:34.000
Python can be decently fast if you have the right data and doing the right things.

18:34.000 --> 18:41.000
It's just it shifted the mentality for the entire project.

18:42.000 --> 18:46.000
And in general, I think our contributors prefer Rust over Python.

18:46.000 --> 18:49.000
Not all of them mind you.

18:49.000 --> 18:58.000
But the people that are thinking hard about the problems that make material scale that make it move into the next generation of virtual virtual.

18:58.000 --> 19:06.000
Prefer to do it in Python in Rust and not in Python in C written by Colonel developers 20 years ago somehow.

19:06.000 --> 19:09.000
So yeah, I think this is it for me.

19:10.000 --> 19:13.000
Thank you and feel free to come talk to me after this.

19:28.000 --> 19:29.000
Yes.

19:32.000 --> 19:34.000
I'm assuming there's speechless.

19:40.000 --> 19:57.000
What does interfacing between Python and Rust look like in in the code base sort of sorry could you speak up like it's all shooting this way.

19:57.000 --> 20:08.000
What does it look like interfacing between the Python and the Rust because one of the problems that you mentioned with C is that you've still got this sort of translation between the data structures and what you see in Python.

20:09.000 --> 20:14.000
So how are you sort of eliminating that class of problems using Rust?

20:14.000 --> 20:23.000
Yeah, so one part of the problem is how hard is it to write FFI code like the glue code and it's easier thanks to the ecosystem like Pio 3.

20:23.000 --> 20:30.000
We were using Rusty Python first and then we moved to Pio 3 when it was when it fit our needs.

20:31.000 --> 20:40.000
It's it helps doing that a lot easier but also what I was saying is that at some point you need to move like one layer up for the status work.

20:40.000 --> 20:49.000
I actually actually started doing the status in Rust but there was a lot of back and forth like one back and forth per file and that was absolutely not not scalable.

20:49.000 --> 20:54.000
So you kind of have to raise the bar unless the only way that you can do it in a performance matter.

20:54.000 --> 20:57.000
There you go.

20:57.000 --> 21:04.000
Yeah, so you talked about our RHG status going super fast.

21:04.000 --> 21:19.000
So the code in Rust is it vertical command by command or is it doing horizontal if you were to draw it out maybe where it's doing the file system stuff and then the Python builds on top of it how.

21:19.000 --> 21:24.000
So the RG stuff specifically does not start Python this is pure rust.

21:24.000 --> 21:30.000
Yeah, but when you we have the I think it's about 15% rust in the project.

21:30.000 --> 21:31.000
Yes.

21:31.000 --> 21:34.000
Was it like horizontal layers it was it command by.

21:34.000 --> 21:46.000
Yes, so we have kind of two two different bits of rust in the project one of them is the pure rust version which is kind of on its own tree and its own side it's independent from the from the Python side.

21:46.000 --> 21:51.000
And it plugs into a shared library which is then hooked into Python.

21:51.000 --> 22:01.000
So some of that is used both as just Python extensions that look like normal Python except that they're written in Rust and other stuff is completely pure us.

22:01.000 --> 22:14.000
So for for status we both have a Python version that calls into the status and comes back with an answer and a pure rust version that uses the same code.

22:14.000 --> 22:15.000
There.

22:15.000 --> 22:16.000
Great.

22:16.000 --> 22:17.000
Thanks.

22:17.000 --> 22:18.000
Anyone else?

22:18.000 --> 22:19.000
No.

22:19.000 --> 22:20.000
Yeah.

22:20.000 --> 22:32.000
We'll the Rust version become the default one for status for example or what is the path for making that.

22:32.000 --> 22:35.000
I just had didn't catch the very first part.

22:35.000 --> 22:43.000
Will the RHG status become default at some point or now do you do that?

22:44.000 --> 22:47.000
The points right now is RG is transparent.

22:47.000 --> 22:53.000
You can like it doesn't look it looks like HG and if it doesn't know how to do something it will fall back to HG.

22:53.000 --> 22:57.000
So it's kind of like it doesn't you don't have to ask the user to migrate.

22:57.000 --> 23:05.000
But at some point we will have to build like a new way of interacting with Macuril that is not the HG command line.

23:05.000 --> 23:08.000
But that was very much out of scope for this talk.

23:08.000 --> 23:09.000
Yes.

23:09.000 --> 23:15.000
Thank you.

23:15.000 --> 23:18.000
If there's no other questions, thank you very much.

23:18.000 --> 23:19.000
Thank you.

23:19.000 --> 23:29.000
Thank you.

