WEBVTT

00:00.000 --> 00:12.000
Hello everyone, I would like to introduce our next speaker, a Patrick Steinhardt.

00:12.000 --> 00:23.400
Thank you, so good morning everyone and welcome to my talk, evolving it for the next

00:23.400 --> 00:25.200
decade.

00:25.200 --> 00:29.400
Let me first introduce myself, so my name is Patrick Steinhardt.

00:29.400 --> 00:36.080
My interest in open-source software started around 22, when it was 11 years old.

00:36.080 --> 00:40.560
My elementary school teacher back then was very big into computers, so we spent almost

00:40.560 --> 00:43.680
every week in a well equipped computer room.

00:43.680 --> 00:47.960
One of the things that my teacher introduced me to back then was this magical thing called

00:47.960 --> 00:48.960
Linux.

00:48.960 --> 00:51.440
Linux really got me hooked.

00:51.440 --> 00:56.240
You could play around with it, make it to weird stuff, and break the whole computer while

00:56.240 --> 01:01.480
trying to make your Windows burn down when you close them, or just to get a fancy 3G cube

01:01.480 --> 01:05.080
if you for example want to have virtual work spaces.

01:05.080 --> 01:10.240
My father on the other side wasn't that happy, because we had frequent arguments around

01:10.240 --> 01:15.360
why I wiped the computer once again, or why the engineer doesn't work.

01:15.360 --> 01:20.320
That being said, that eventually kickstarted my interest in software engineering.

01:20.320 --> 01:26.000
I bought my first book about programming when I was 12 years old, and eventually started

01:26.000 --> 01:32.640
to do some small contributions to open-source software in 2011.

01:32.640 --> 01:37.440
In 2015, my involvement with open-source software development changed significantly when I found

01:37.440 --> 01:42.800
a job posting that was about contributing to an open-source version control system.

01:42.800 --> 01:44.680
The deal was rather simple.

01:44.680 --> 01:49.120
I had to do something related to version control systems, and that knowledge that I gained

01:49.120 --> 01:53.560
was then sold to customers by doing trainings and consulting.

01:53.560 --> 01:58.360
I had reached a choice of which software projects I wanted to contribute to, and there's

01:58.360 --> 02:02.800
exactly what I wanted to do there, which was awesome.

02:02.800 --> 02:07.600
For me, the choice was basically between subversion and git, and I had to release that

02:07.600 --> 02:10.280
by both ecosystems.

02:10.280 --> 02:14.400
To be honest, I just came from a job where I had to use subversion, so you might understand

02:14.400 --> 02:19.320
why I'm not wanted to have anything to do with subversion in the first place.

02:19.320 --> 02:23.520
I instead chose git, and that's how I eventually became one of the core contributions

02:23.520 --> 02:27.920
just to both git and lip-git two.

02:27.920 --> 02:33.960
In 2020, I then switched to git lab as a back-end engineer, where we're working on giddily.

02:33.960 --> 02:40.320
Giddily is the RPC service that sits between git lab and your git repositories.

02:40.320 --> 02:44.640
Eventually, my responsibility shifted the gattle a bit, to us contributing to git upstream

02:44.640 --> 02:45.640
only.

02:45.640 --> 02:49.960
We faced multiple performance bottlenecks that we wanted fixed, and fixing those with

02:49.960 --> 02:53.800
required significant long-term investments to git.

02:53.800 --> 02:57.840
One of these efforts was, for example, the Reftable Back-end, which I've been talking about

02:57.840 --> 03:02.000
way too much over the last couple of years, and this somehow also gave me the nickname

03:02.000 --> 03:07.520
shiny the Reftable Guy and some contexts.

03:07.520 --> 03:08.520
Giddily's done.

03:08.520 --> 03:10.520
What is there to change?

03:10.520 --> 03:15.400
This is something that I heard quite often over the last year, and I get it.

03:15.400 --> 03:18.600
Just last year, git has turned 20 years old.

03:18.600 --> 03:19.600
Anyone uses it.

03:19.600 --> 03:24.920
It works, so why mess with the success?

03:24.920 --> 03:28.560
The success of git is indeed quite staggering.

03:28.560 --> 03:32.920
94% of all the developers out there are using a day-to-day, and there are hundreds of

03:32.920 --> 03:38.200
millions of git repositories out there, and many more strips depending on it.

03:38.200 --> 03:45.080
It is safe to say that git is everywhere nowadays in the world of self-redevelopment.

03:45.080 --> 03:48.520
So that appears the question, is git really done?

03:48.520 --> 03:53.440
Is it the perfect version control system that doesn't require any changes anymore?

03:53.440 --> 03:59.720
Well, it might not surprise you, but for me the answer is a definite no.

03:59.720 --> 04:05.240
The world has changed quite significantly since 2005.

04:05.240 --> 04:08.520
Git was designed for a different era.

04:08.520 --> 04:13.720
In 2005, Shawan was considered to be a secure hash function, but that has changed with

04:13.800 --> 04:17.400
the shatter-to-tack and other attacks on Shawan.

04:17.400 --> 04:21.800
Back then, the linear strunnel was considered to rather large repositories.

04:21.800 --> 04:25.720
Nowadays, it's dwarfed by repose like the cromely repository, which are almost a hundred

04:25.720 --> 04:28.480
gigabyte in size.

04:28.480 --> 04:32.400
CI systems were kind of the exception, and it could count yourself lucky if you've

04:32.400 --> 04:36.120
had, for example, a Jenkins instance available to you.

04:36.120 --> 04:40.720
Nowadays, frequently see projects where you have crazy huge CI pipelines, where every single

04:40.800 --> 04:44.440
commit kicks of thousands of jobs.

04:44.440 --> 04:49.240
Monor repose is a term that nobody has heard about back then, but nowadays, everyone

04:49.240 --> 04:51.600
is using them.

04:51.600 --> 04:57.840
And also, it was very hard to use back them, but to be quite honest, it's still hard to

04:57.840 --> 05:06.040
use nowadays.

05:06.040 --> 05:11.440
So the world has changed, and that makes us clear that Git needs to change as well.

05:11.440 --> 05:15.280
But the unique position of Git means that we cannot have a revolution.

05:15.280 --> 05:19.640
There are millions of developers out there, and many projects that rely on Git.

05:19.640 --> 05:24.680
So we must make sure to not break the world, which simply can't break the world.

05:24.680 --> 05:30.840
Too many people depend on Git, but we can make it better, one step at a time.

05:30.840 --> 05:33.560
This evolution is what I want to talk about today.

05:33.560 --> 05:37.080
I want to highlight a couple of important transitions that Git is currently going through

05:37.080 --> 05:43.960
to ensure that the product stays relevant as the world keeps on changing.

05:43.960 --> 05:46.680
So let's dive right into our first topic.

05:46.680 --> 05:51.520
This is one of the most visible, user-visible changes that has currently happening in the

05:51.520 --> 05:56.560
Git world, the Shart of 56 Transition.

05:56.560 --> 05:59.120
Shawan is a central part of Git's design.

05:59.120 --> 06:03.320
Every single object has an identity, and that identity is computed by hashing the

06:03.360 --> 06:05.200
contents of the object.

06:05.200 --> 06:10.080
So for blobs, we hash the file contents, for trees, we hash the directory structure,

06:10.080 --> 06:15.880
and for commits we hash authorship information, commit message, and the root tree.

06:15.880 --> 06:18.560
Git objects are set to be content addressable.

06:18.560 --> 06:22.040
Given the contents, you know the name of the object.

06:22.040 --> 06:26.200
The result is that you have implicit integrity verification for your object.

06:26.200 --> 06:31.400
You can easily deduplicate them, and the history becomes immutable.

06:31.400 --> 06:34.440
This object name is computed by using Shawan.

06:34.440 --> 06:38.040
If you have, for example, a blob that contains the storing hello world, you only need

06:38.040 --> 06:42.640
to prefix the object header, and then compute the Shawan to get the object name.

06:42.640 --> 06:44.800
This one problem though.

06:44.800 --> 06:48.920
Shawan is not a secure hash function anymore.

06:48.920 --> 06:55.440
In 2017, Google and the CWI Amsterdam Research Institute proved that known theoretical attacks

06:55.440 --> 06:59.520
on Shawan are viable in practice with the Shattered Attack.

06:59.520 --> 07:04.200
The result of this partnership are two syntactically valid PDF files that both result

07:04.200 --> 07:06.560
on the Shawan hash.

07:06.560 --> 07:15.040
The attack requires around 9kF1 computations, which requires 110 years of single GPU computations.

07:15.040 --> 07:19.200
This may seem like a lot, but if you for example want to brute the force Shawan instead,

07:19.200 --> 07:23.360
a hash collision would require 12 million GPU years to compute.

07:23.360 --> 07:26.640
So quite broken.

07:26.640 --> 07:31.440
Also, you can imagine that with all the recent type that we have around artificial intelligence,

07:31.440 --> 07:36.920
data centers have become increasingly expanded their GPU capacity.

07:36.920 --> 07:41.720
So nowadays, it is very much in reach for a large player to compute hash collisions at

07:41.720 --> 07:45.080
well, if I want to.

07:45.080 --> 07:50.200
Of course, as Git heavily relies on Shawan, the Shattered Attack has kicked off a huge

07:50.200 --> 07:53.200
and intense discussion on the Git mailing list.

07:53.280 --> 07:57.400
It has been asserted since the beginning that the use of Shawan is not primarily for security

07:57.400 --> 07:58.400
though.

07:58.400 --> 08:02.920
There is a couple of arguments that are made in this context.

08:02.920 --> 08:07.880
First, the object hash is mostly used as an integrity check to detect disruption, bit

08:07.880 --> 08:10.640
flips, and transmission errors.

08:10.640 --> 08:13.200
Also, source code is transparent.

08:13.200 --> 08:17.320
If you see a merchant best for example, where somebody enters random collision data into

08:17.320 --> 08:22.320
your source code, then you might probably ask some questions.

08:22.320 --> 08:28.320
Also, the object format that Git adds has at some protection, because we prevent the object

08:28.320 --> 08:29.880
length to the object.

08:29.880 --> 08:35.680
This means that you can adjust like, prevent collision data to that object.

08:35.680 --> 08:40.280
And last but not least, there is also other security measures, like GPU signatures, and

08:40.280 --> 08:45.840
correct the transports at a web of trust between developers.

08:45.840 --> 08:50.360
But the reality is that things are a little bit more complicated.

08:50.360 --> 08:54.920
Also, you can use GPU signatures to sign your commit, but unfortunately, that signature

08:54.920 --> 08:57.480
is on the Shawan commit hash.

08:57.480 --> 09:04.920
Subsequently, if you can create a collision, you cannot trust GPU signatures at all anymore.

09:04.920 --> 09:06.960
Also, not everything is source code.

09:06.960 --> 09:11.400
Whether you like it or not, some repositories out there contain binary blocks like firmware

09:11.400 --> 09:13.800
or compiled assets, for example.

09:13.800 --> 09:19.640
It's almost impossible to verify whether those might have contained trusted collisions or

09:19.640 --> 09:23.760
whether because they are not human readable.

09:23.760 --> 09:27.720
Also, a lot of our modern tooling builds its trust on top of Git commit-tashes.

09:27.720 --> 09:34.640
You see this in CI-CD, in scripts that interact with Git programming languages that perform

09:34.640 --> 09:37.080
dependency-pinning and so on.

09:37.080 --> 09:42.360
And many of those use cases, we implicitly trust the Git commit hash.

09:42.360 --> 09:48.800
And finally, government and enterprise policies also mandate their removal of Shawan by 2030

09:48.800 --> 09:52.400
in favor of more secure hash functions.

09:52.400 --> 09:57.320
So overall, it's safe to say that even if Git itself does not rely on Shawan for security,

09:57.320 --> 10:01.840
the ecosystem very much does.

10:01.840 --> 10:04.080
The fix for this is of course quite obvious.

10:04.080 --> 10:07.840
Let's just swap out Shawan and replace it with a different hash function.

10:07.840 --> 10:09.920
And that's exactly what happened.

10:09.920 --> 10:16.000
In Git-229, which has been published in October 2020, we have added support for Shawan

10:16.000 --> 10:18.000
and Shathive-56 instead.

10:18.000 --> 10:23.720
You can simply create the new repository with saying Git in it, dash-upperic format equal

10:23.720 --> 10:27.880
Shathive-56, and then you'll get to use a different hash function.

10:27.880 --> 10:30.480
The code is there, it works.

10:30.480 --> 10:33.440
You can use it today for your new repositories.

10:33.440 --> 10:36.640
But somehow, nobody out there is using Shathive-56.

10:36.640 --> 10:40.160
So what is taking us so long?

10:40.160 --> 10:43.440
The problem is that Git has very strong networking effects.

10:43.440 --> 10:48.560
You cannot just implement Shathive-56 in the Git command line and then call it the day,

10:48.560 --> 10:51.720
because we also need to consider the ecosystem.

10:51.720 --> 10:55.320
But unfortunately, this situation looks somewhat grim here.

10:55.320 --> 11:00.960
Next to Git, there's only a single forge and a single library that foolishly support Shathive-56.

11:00.960 --> 11:05.520
GitLab and if you other libraries have also experimental Shathive-56 support.

11:05.520 --> 11:10.880
But some, well, rather insignificant players like for example GitHub, don't support Shathive-56

11:10.880 --> 11:13.280
at all.

11:13.280 --> 11:15.680
This creates a chickened act problem.

11:15.680 --> 11:20.840
Nobody's moving to Shathive-56 because it is not supported by large forges, but large forges

11:20.840 --> 11:25.280
don't implement support because there's no demand.

11:25.280 --> 11:27.600
The problem is that we cannot wait forever.

11:27.600 --> 11:31.160
It will become more and more feasible over time to break Shawan.

11:31.160 --> 11:35.720
And the next cryptographic weakness may be just around the corner.

11:35.720 --> 11:40.040
We need to consider that even if we had full support for Shathive-56, projects still need

11:40.040 --> 11:42.160
time to migrate.

11:42.160 --> 11:45.800
So that's why we need to break the cycle.

11:45.800 --> 11:50.520
The Git project has decided to make Shathive-56 the default hash for newly created repositories

11:50.520 --> 11:52.400
in Git3.0.

11:52.400 --> 11:57.680
Our hope is that by making Shathive-56 the default hash function, we are forcing both forges

11:57.680 --> 12:00.720
and third-party implementations to adapt.

12:00.720 --> 12:07.320
The message is clear, Shathive-56 is the future, get ready for it.

12:07.320 --> 12:10.960
This transition will likely not really be an easy one, and it may result in a couple

12:11.000 --> 12:13.000
of hiccups along the road.

12:13.000 --> 12:15.200
But if you're interested, you can also help.

12:15.200 --> 12:19.040
We can start playing around the Shathive-56 backshared and repost with it.

12:19.040 --> 12:24.280
You can show your favorite code forges that you care about Shathive-56 so that they

12:24.280 --> 12:30.200
bump the priority, and you can even try and help third-party tools that depend on Git

12:30.200 --> 12:33.600
by adding Shathive-56 support.

12:33.600 --> 12:37.600
Together, we can hopefully get the ecosystem to move before the next vulnerability in

12:37.600 --> 12:38.600
Shathive-56.

12:42.600 --> 12:46.560
I need to live up to my nickname Shathive-56, and talk a little bit about my favorite

12:46.560 --> 12:49.080
topic, which is RefTables.

12:49.080 --> 12:53.200
RefTables are another significant shift that is currently happening in Git repositories

12:53.200 --> 12:57.160
as a new back and to store your references.

12:57.160 --> 13:00.400
Before we talk about RefTables, though, I first want to give you a little bit of context

13:00.400 --> 13:06.040
about how references are starting Git right now, and why that is a problem.

13:06.040 --> 13:10.640
When trading or updating references, they are by default stored in the loose format.

13:10.640 --> 13:13.520
Every reference is stored as a separate file.

13:13.520 --> 13:16.560
The reference format is really easy to understand.

13:16.560 --> 13:22.360
To demonstrate, let's create a simple repository with a commit and a branch.

13:22.360 --> 13:24.760
The first file will examine its hat.

13:24.760 --> 13:27.480
This file indicates what you are currently checked out branches.

13:27.480 --> 13:33.560
As you can see, it has a rough column prefix, which means that this is a zombolic reference,

13:33.560 --> 13:35.800
where the target is roughshath's main.

13:35.800 --> 13:40.400
So we know we have the main branch checked out in that repo.

13:40.400 --> 13:44.200
All the other references are stored in the RefToric to rehire key.

13:44.200 --> 13:46.560
As we can see, we got two files in there.

13:46.560 --> 13:49.000
RefTables feature and refshath's main.

13:49.000 --> 13:56.760
These are all branches, and they contain the object that either pointing to as contents.

13:56.760 --> 14:00.720
Now storing every single reference as a separate file works well when your repository only

14:00.720 --> 14:02.760
has a handful of them.

14:02.760 --> 14:05.720
But when you have hundreds or even thousands of refs, then it becomes

14:05.720 --> 14:07.400
really inefficient.

14:07.400 --> 14:11.520
Every reference needs a separate I node, which typically also means that it needs a full

14:11.520 --> 14:13.560
disk sector.

14:13.560 --> 14:17.200
Listing all your references also becomes increasingly expensive, as you have to read many

14:17.200 --> 14:19.320
many files.

14:19.320 --> 14:22.960
So git regularly packs your references to contract this.

14:22.960 --> 14:26.720
Instead of storing each reference in a separate file, they get packed into a packed refs

14:26.720 --> 14:28.840
file.

14:28.840 --> 14:32.640
You can manually pack references by saying git pack refs-shall.

14:32.640 --> 14:37.080
You don't typically have to do that automatically that yourself because git does it automatically

14:37.080 --> 14:39.280
for you.

14:39.280 --> 14:44.040
As you can see after executing this command, our loose references have gone.

14:44.040 --> 14:48.560
Instead, we now have a simple packed refs file that is a simple sorted list of all the

14:48.560 --> 14:52.320
references that have just been packed.

14:52.320 --> 14:55.840
Now you know how the files back and works for storing references.

14:55.840 --> 14:58.760
Why does it need to change?

14:58.760 --> 15:02.760
The first problem is that file systems are simply weird.

15:02.760 --> 15:08.160
Once special system here in my heart is Windows, which resolves all kinds of file names.

15:08.160 --> 15:12.240
And as we encode reference names, via the file system path, it means that you cannot

15:12.240 --> 15:20.360
create a branch that is named con, pure n, ox, null, con1 to 9, LPT1 to 9 and more.

15:20.360 --> 15:25.720
There's also many file systems out there like nTFS, fat or hfs plus that are a case insensitive

15:25.720 --> 15:26.720
by default.

15:26.960 --> 15:31.120
And again, the consequence is that you cannot create two branches that only differ in

15:31.120 --> 15:33.320
casing.

15:33.320 --> 15:37.760
Mac OS also does somewhat weird stuff, where they may change the way that your file name

15:37.760 --> 15:42.240
is represented and encoded in case they can say in unicode characters.

15:42.240 --> 15:46.760
So what did one store on disk and what Mac OS actually decides to store on disk might

15:46.760 --> 15:48.760
be different.

15:48.760 --> 15:52.520
In the best case, you know that these restrictions apply and won't ever try to create

15:52.520 --> 15:53.520
such branches.

15:53.920 --> 16:01.200
In the worst case, you're stuck on Windows.

16:01.200 --> 16:05.920
If you want to write 20 references, you have to create 20 separate files.

16:05.920 --> 16:10.040
This does not only take long when you consider performance, but for typical file systems

16:10.040 --> 16:15.840
it also means that each of these references may require four kilobytes of storage.

16:15.840 --> 16:18.000
And that is up rather quickly.

16:18.000 --> 16:23.160
Packer references though is expensive, but mandatory when a repository has many branches

16:23.200 --> 16:25.240
to retain good performance.

16:25.240 --> 16:30.040
You have to rewrite the complete pack rest file on every repack though, which is typically

16:30.040 --> 16:33.720
fine for repost that only have a handful of them.

16:33.720 --> 16:36.480
But, get users are not always reasonable.

16:36.480 --> 16:41.640
One of the worst repositories that we for example host at GitLab is containing around

16:41.640 --> 16:48.280
20 million references, which sums up to a pack rest file that is two gigabytes in size.

16:48.280 --> 16:49.960
Which brings me to the next point.

16:49.960 --> 16:55.040
Deleting a reference also requires us to rewrite the complete pack rest file.

16:55.040 --> 16:59.600
So every time someone did release a reference and that repository, we have to rewrite

16:59.600 --> 17:01.720
two gigabytes of data.

17:01.720 --> 17:06.080
And to add insult to injury, this repository typically deletes references every couple

17:06.080 --> 17:11.960
seconds, not exactly efficient.

17:11.960 --> 17:13.480
Concurrency is an afterthought.

17:13.480 --> 17:17.680
There are multiple issues with representing references as single files when you have multiple

17:17.680 --> 17:22.160
readers and writers in your repository at the same point of time.

17:22.160 --> 17:27.000
One of the problems is that it is impossible to get a consistent view of all your references.

17:27.000 --> 17:32.200
To get that you would have to open multiple files and each of them could change concurrently.

17:32.200 --> 17:36.480
So when somebody writes to the repository while you read references, you cannot tell whether

17:36.480 --> 17:43.080
the result you got is consistent or whether it is a mixture of the old and the new state.

17:43.080 --> 17:46.840
Similarly, it is impossible to write more than one reference.

17:46.840 --> 17:51.120
Each reference you want, you want to write is a separate file and this you cannot commit

17:51.120 --> 17:53.640
them all at once.

17:53.640 --> 17:56.560
There is also no way to log the reference database.

17:56.560 --> 18:01.680
Well there could be a central log file including that retroactively would likely break

18:01.680 --> 18:04.400
all kinds of use cases out there.

18:04.400 --> 18:07.720
These problems have all been known for a very long time already and that is where the

18:07.720 --> 18:11.240
Refttable Backend comes into play.

18:11.240 --> 18:16.000
You can create a newer repository with the Refttable Backend by passing dash as Reftformit

18:16.000 --> 18:18.000
equals Refttable.

18:18.000 --> 18:21.320
To get in it, that is basically all you need to know.

18:21.320 --> 18:25.600
After what the repository is expected to behave the exact same as with the files from

18:25.600 --> 18:27.240
it.

18:27.240 --> 18:31.680
But still let's have a deeper look at what the repository is structured like.

18:31.680 --> 18:36.360
Here isly enough we still see that we have the Reft directory and the Had file, which

18:36.360 --> 18:38.880
were also present in the files backend.

18:38.880 --> 18:41.440
But these files are merricompatibility stops.

18:41.440 --> 18:45.960
They don't contain any actual data, but they have to exist because it does not require

18:45.960 --> 18:53.000
a repository to be a repository unless those files exist.

18:53.000 --> 18:56.080
What's new is that we now also have a Refttable directory.

18:56.080 --> 19:00.480
The first data structure that is specific to Reftables is the tables.list file.

19:00.480 --> 19:05.440
This file tracks the currently active list of tables in the repo.

19:05.440 --> 19:11.000
So whenever your update reference, give it a right and new table and depend it to this list.

19:11.000 --> 19:15.080
This mechanism is really important because it allows for atomic updates.

19:15.080 --> 19:18.760
You can get a consistent snapshot by reading the table slot list file and then loading

19:18.760 --> 19:20.880
all of the reference files.

19:20.880 --> 19:25.440
Then you can perform an atomic right of multiple tables by writing a table, writing a

19:25.440 --> 19:32.360
temporary table slot list file, and then atomic killer renaming that into place.

19:32.360 --> 19:35.760
The tables themselves are stored in a binary format.

19:35.760 --> 19:39.400
While the binary format is more complex than a text-based format, it allows us to store

19:39.400 --> 19:41.520
data more efficiently.

19:41.520 --> 19:45.040
Also as reference names are not in code if you have the file system path anymore, you

19:45.040 --> 19:50.400
are not subject to file system limitations here.

19:50.400 --> 19:52.320
Reftables use a block-based structure.

19:52.320 --> 19:56.880
Every block is exactly 4 kilobytes of data that so that it fits exactly into a disk

19:56.880 --> 19:58.040
sector.

19:58.040 --> 20:01.280
This allows us to efficiently read a single block.

20:01.280 --> 20:03.280
Each block also has a specific type.

20:03.280 --> 20:07.480
Reft blocks, for example, store our references, but there are also other types that we

20:07.480 --> 20:10.560
will not go into detail today.

20:10.560 --> 20:14.080
Furthermore, every section of blocks may have an optional index.

20:14.080 --> 20:18.640
Each index entry stores the last record name that a given block contains, which allows us

20:18.640 --> 20:21.240
to quickly find a specific record.

20:21.240 --> 20:26.080
So let's say before we want to look at the branch called D, we would then first with the

20:26.080 --> 20:32.040
index block, it tells us that the first block contains references up to rest hat C, and

20:32.040 --> 20:35.760
we know that this block will not contain the reference that we are searching for.

20:35.760 --> 20:41.360
Second, we see that the second block contains also references up to rest hat G.

20:41.360 --> 20:45.000
So if the reference we are searching for exists, it must be in that block.

20:45.000 --> 20:50.720
So read the target block and search our reference there.

20:50.720 --> 20:54.640
Now if we zoom into one of those ref blocks, we can see that the blocks contain a lexicographically

20:54.640 --> 20:58.560
sorted list of refs with the irrespective object ID.

20:58.560 --> 21:01.880
One important bit here is the great out part of the ref names.

21:01.880 --> 21:05.280
These are prefixes that are common with the preceding reference.

21:05.280 --> 21:09.280
The ref table format uses prefix compression to save a little bit of space.

21:09.280 --> 21:13.120
Instead of storing the full ref name, we only state how many bytes to reuse from the

21:13.120 --> 21:19.560
preceding reference, and then only have the deferring bits.

21:19.560 --> 21:23.280
A major difference is also how we pack references.

21:23.280 --> 21:28.160
With the files back end, we write lots of ref loose references all the time and eventually

21:28.160 --> 21:30.840
do an all into one repack.

21:30.840 --> 21:34.000
With the ref table back end, thanks for a little bit differently.

21:34.000 --> 21:38.040
Every time we write references, we append a new table to the stack of tables.

21:38.040 --> 21:42.000
Afterwards, we verify whether the tables form a geometric sequence.

21:42.000 --> 21:45.880
The next table must be at most half the size of the current one.

21:45.880 --> 21:49.320
This check happens every single time we update the stack.

21:49.320 --> 21:54.440
So we basically ensure that the stack is well optimized as we go.

21:54.440 --> 21:58.960
To demonstrate, we start out with two tables, one that contains eight references and one that

21:58.960 --> 22:01.760
only contains a single ref.

22:01.760 --> 22:04.800
So let's write a new table with a single reference.

22:04.800 --> 22:09.960
We see the geometric sequence, property is not maintained anymore, as one is not smaller

22:09.960 --> 22:14.440
or equal than half of one, so you're merged them together.

22:14.440 --> 22:19.360
There is no need to come back further though, as two is less than half of eight and

22:19.360 --> 22:22.640
so the remaining tables form a geometric sequence.

22:22.640 --> 22:24.600
We create another table with a single ref.

22:24.600 --> 22:29.160
We see the geometric sequence obtained, no need to merge.

22:29.160 --> 22:33.200
If you now create another table, then we will have to merge one time, but the geometric

22:33.200 --> 22:38.040
sequence is still not maintained, so we have to merge a second time.

22:38.040 --> 22:42.200
The result is that we always have at most all lock ref's many tables, which ensures

22:42.200 --> 22:46.680
that reads continue to be fast.

22:46.680 --> 22:51.200
We now have a very rough understanding of how ref tables work, but why do we do this whole

22:51.200 --> 22:55.760
exercise to swap out the storage layer in the first place?

22:55.760 --> 23:00.960
With the files back end, we're subject to a lot of specifics of a file system, as reference

23:00.960 --> 23:03.760
names are derived from path names.

23:03.760 --> 23:08.160
Reft ref tables all of these issues go away, as the ref names are encoded in the individual

23:08.160 --> 23:11.160
tables.

23:11.160 --> 23:14.600
Also while very much hope that you don't have to rock and repose that contain million

23:14.600 --> 23:19.280
of references, such repose exist out there, and if you work in them, then you might

23:19.280 --> 23:24.320
very well appreciate the improved performance.

23:24.320 --> 23:28.080
All of the files back end does not allow for atomic updates, as references are written

23:28.080 --> 23:30.640
one by one, as separate files.

23:30.640 --> 23:34.800
This issue goes away with the ref table back end, where our reads are consistent and rides

23:34.800 --> 23:35.800
our atomic.

23:35.800 --> 23:40.240
You probably don't care too much about this on the client side, but on the server side,

23:40.240 --> 23:43.440
this is a huge improvement.

23:43.440 --> 23:47.440
In last but not least, ref tables will also become the default in get 3.0.

23:47.440 --> 23:51.560
So if you for example use githin scripts, or if you use it on the server side, then you

23:51.560 --> 23:54.920
should verify that you don't play weird games by accessing reference directly

23:54.920 --> 23:56.640
if you're at the file system.

23:56.640 --> 24:01.120
You should always access references via githin command, and if you do so, then you shouldn't

24:01.120 --> 24:03.440
observe any differences.

24:03.440 --> 24:11.840
So, we talked a little bit about how we improving support of our references in get.

24:11.840 --> 24:16.160
But this addresses some theoretical issues, I would claim that most of you in this room

24:16.160 --> 24:20.880
probably don't encounter those problems in practice.

24:20.880 --> 24:24.640
When it comes to scalability bottlenecks, the more important problem tends to be large

24:24.640 --> 24:25.640
files.

24:25.640 --> 24:30.680
The storing large binary files in get is unfortunately not a use case that is well-supported

24:30.680 --> 24:31.680
nowadays.

24:31.680 --> 24:36.600
As a workaround, develop a set to reserve to third-party solutions like, for example, get

24:36.600 --> 24:38.160
LFS.

24:38.160 --> 24:41.760
This is something that we want to change.

24:41.760 --> 24:47.440
But first, why are large files a hard problem for a get in the first place?

24:47.440 --> 24:49.720
There's two important issues here.

24:49.720 --> 24:53.000
The first problem is how get compresses objects.

24:53.000 --> 24:59.240
Get works extremely well for repositories with text files, like for example source code.

24:59.240 --> 25:03.880
First, it uses z-lib compression to reduce the general size of objects.

25:03.880 --> 25:09.160
And second, get no store incremental changes to objects as deltas.

25:09.160 --> 25:13.040
Together, this achieves great compression ratios for text files.

25:13.040 --> 25:16.840
After all, this is what get was designed for.

25:16.840 --> 25:21.200
But unfortunately, z-lib compression tends to not work well for binary files.

25:21.200 --> 25:26.440
And computing the deltas becomes increasingly expensive for larger your files get.

25:26.440 --> 25:31.840
The consequence is that even small edits, to such files, end up trading entirely new objects

25:31.840 --> 25:36.120
without using any deltas at all.

25:36.120 --> 25:39.040
The second problem occurs on the networking layer.

25:39.040 --> 25:44.840
Whenever you clone a get repository, you get a full copy of all of the history by default.

25:44.840 --> 25:47.160
This is what you want for normal repose.

25:47.160 --> 25:51.360
But once we're talking about large monorepos, with binary files in them, then you probably

25:51.360 --> 25:56.000
don't want download hundreds of gigabytes of data.

25:56.000 --> 26:00.240
This is further stressed by the fact that there is no support for resimble the clones.

26:00.240 --> 26:05.640
So if you have download 400 gigabytes out of a 500 gigabyte repository and your network

26:05.640 --> 26:10.040
disconnects, then you will have to read download everything.

26:10.040 --> 26:13.920
And because deltification does not work for large binary files, you have to read download

26:13.920 --> 26:18.160
the full block contents every single time a large binary file changes instead of only

26:18.160 --> 26:21.600
fetching the incremental changes.

26:21.600 --> 26:27.080
The result is that many teams work with large files simply avoid using get all together,

26:27.080 --> 26:30.560
which is unfortunate.

26:30.560 --> 26:34.480
Of course, large monorepos don't only cause issue on the client side.

26:34.480 --> 26:37.200
Code forges are also struggling with them.

26:37.200 --> 26:41.880
First and foremost, forges don't have the luxury of partial clones, for example.

26:41.880 --> 26:45.520
It needs to have all objects available, as it would otherwise not be able to serve those

26:45.520 --> 26:47.320
to the client.

26:47.320 --> 26:50.040
The consequence is a significant storage cost.

26:50.040 --> 26:55.800
Our analysis on GitLab.com has shown that 75% of our storage space for get repositories

26:55.800 --> 27:00.840
can be attributed to binary files larger than one megabyte.

27:00.840 --> 27:05.640
The huge repose sizes also cause repository maintenance to become very expensive.

27:05.640 --> 27:10.440
We have to rewrite objects every once in a while, for example, to delete some of them.

27:10.440 --> 27:14.960
If your repository contains large binaries, then this data becomes computationally very

27:14.960 --> 27:16.960
expensive.

27:16.960 --> 27:22.480
Also, it is not possible to offload any of those objects to a content delivery network.

27:22.480 --> 27:26.200
All data needs to be served by the git server, and that makes it become a significant

27:26.200 --> 27:28.200
bottleneck.

27:28.200 --> 27:33.080
So in summary, large objects are a significant cost factor for any large git provider

27:33.080 --> 27:35.800
out there.

27:35.800 --> 27:40.120
Git users have adapted to work around those shortcomings with band aids.

27:40.120 --> 27:42.760
GitLab as is one such solution.

27:42.760 --> 27:46.960
Instead of storing actual file contents in Git, you end up storing only a pointer to the

27:46.960 --> 27:48.480
object contents.

27:48.480 --> 27:52.080
The actual content is then stored on a separate server that is better suited for storing

27:52.080 --> 27:54.920
binary data.

27:54.920 --> 27:59.400
This solution keeps the repository small, and is well supported by hosting providers.

27:59.400 --> 28:01.160
But it's not part of Git.

28:01.160 --> 28:05.000
It doesn't know to transfer deltas, and once you have accidentally committed any large

28:05.000 --> 28:10.080
blob into your history, then it's stuck there forever.

28:10.080 --> 28:14.440
Azure clones allow you to clone a repository while filtering out certain objects, like

28:14.440 --> 28:16.280
for example, blobs.

28:16.280 --> 28:20.600
This mechanism is a native part of Git, and transparent to the user, as Git will automatically

28:20.600 --> 28:24.160
fetch those missing objects on demand.

28:24.160 --> 28:26.800
But users have to know how to use it.

28:26.800 --> 28:30.960
There is no automatic pruning of large old files that have not been accessed for a long

28:30.960 --> 28:36.520
time, and servers still cannot offload the traffic.

28:36.520 --> 28:40.960
Azure clones already have existed for quite a while, but I bet many of you have never

28:40.960 --> 28:42.840
used them before.

28:42.840 --> 28:44.480
Overall, it's quite simple.

28:44.480 --> 28:49.840
It's simply specifying object filter, and from there on Git will fetch those filter

28:49.840 --> 28:52.600
blobs whenever required.

28:52.600 --> 28:57.960
If we, for example, clone with dashed to filter blob none, then Git would first fetch all

28:57.960 --> 29:02.320
the non-blob objects, which basically leaves us with the repository shape only, but we

29:02.320 --> 29:05.080
don't have any contents at all yet.

29:05.080 --> 29:09.480
When the check-out begins, and Git realizes that it actually needs to fetch some file

29:09.480 --> 29:10.480
contents.

29:10.480 --> 29:18.040
So, it does a back-to-fetch for only the blobs needed to satisfy checking out the hack commit.

29:18.040 --> 29:22.240
If we eventually need other blobs, that we don't have yet, then Git knows to fetch those

29:22.240 --> 29:24.760
on demand.

29:24.760 --> 29:29.000
For this specific repository, the result is that we have to only download one of the two

29:29.000 --> 29:32.760
gigabytes of data instead of two.8.

29:32.760 --> 29:37.320
Let us mention the problem is that Git users need to know how to use them, and on the

29:37.320 --> 29:41.760
server side, the traffic's still cannot be offloaded to secondary servers.

29:41.760 --> 29:46.240
That's the problem that large object promises aim to solve.

29:46.240 --> 29:47.960
The idea is quite simple.

29:47.960 --> 29:52.720
When a client clones, the server announces a set of promising modes to the client.

29:52.720 --> 29:55.920
Where each promising mode has an object filter attached to it.

29:55.920 --> 30:00.000
The client then automatically selects a subset of those promises and uses the attached

30:00.000 --> 30:02.640
filters to perform the initial clone.

30:02.640 --> 30:06.920
The use promises get stored in the configuration, and are from there on, used whenever the

30:06.920 --> 30:09.600
client needs to backfill some data.

30:09.600 --> 30:12.520
This tries to solve multiple problems.

30:12.520 --> 30:17.720
First, we can now offload traffic to a secondary git server, and this reduced load on the

30:17.720 --> 30:19.720
primary.

30:19.720 --> 30:22.600
Also, the functionality is built right into Git.

30:22.600 --> 30:26.000
There is no need for external tools anymore.

30:26.000 --> 30:29.000
This solution is also fully transparent to the client.

30:29.000 --> 30:34.840
The server can announce an optimal filter, and the client can automatically use it.

30:34.840 --> 30:36.960
But there is one key feature here though.

30:36.960 --> 30:42.100
Git doesn't only support HDPS or SSH clones, but in theory it can also support fetching

30:42.100 --> 30:44.560
via arbitrary transports.

30:44.560 --> 30:48.560
This support is extensible by having so-called remote helpers.

30:48.560 --> 30:52.520
A remote helper is a binary that is simply called Git remote something.

30:52.520 --> 30:57.760
So when Git for example sees the S3 protocol, it loads most to look for a binary called

30:57.760 --> 31:04.000
Git remote S3, and if it exists, it reduces that one to talk to the remote.

31:04.000 --> 31:07.320
The key realization now is that announced problems are remote.

31:07.320 --> 31:13.040
Main of example uses a protocol that stores large objects in an S3 compatible store.

31:13.040 --> 31:17.840
This allows us to offload objects to a content delivery network, and it allows us to store

31:17.840 --> 31:24.800
large blocks in a format that is much better suited for them.

31:24.800 --> 31:29.040
With large object promises we will have the infrastructure in place to let service offload

31:29.040 --> 31:33.960
binary files, and clients will know to automatically use them if desired.

31:33.960 --> 31:36.320
But we still have another issue.

31:36.320 --> 31:40.800
Even with promises, Git's object format still doesn't handle binary files efficiently on

31:40.800 --> 31:42.560
the client side.

31:42.560 --> 31:46.320
This is where a plug will object databases come into play, which will allow us to introduce

31:46.320 --> 31:51.720
a new storage format for a large binary file specifically.

31:51.720 --> 31:56.840
As mentioned before, Git's pack file format uses data compression to store incremental changes

31:56.840 --> 31:59.040
to objects efficiently.

31:59.040 --> 32:04.000
This works amazingly for text files, but for large binaries, computing Delta's this way

32:04.000 --> 32:07.800
is way too expensive, so Git doesn't even try.

32:07.800 --> 32:12.480
Instead, even small edits create entirely new objects once the objects reach a certain

32:12.480 --> 32:14.200
size.

32:14.200 --> 32:19.120
We need a format design for binaries, where incremental changes to a binary file only leads

32:19.120 --> 32:21.480
to a small storage increase.

32:21.520 --> 32:25.200
This new storage format also needs to be efficient for any file size.

32:25.200 --> 32:31.120
The computational complexity should at most grow linearly with a file size.

32:31.120 --> 32:34.480
In last but not least, the format also needs to be compatible with the existing format

32:34.480 --> 32:39.360
somehow, so that you can mix and match the old storage format for text files and the new

32:39.360 --> 32:43.400
storage format for large binaries.

32:43.400 --> 32:47.560
The storage format is deeply baked into Git, but alternative implementations like

32:47.640 --> 32:52.240
Git2, Gogit and JGIT already have plugable back ends.

32:52.240 --> 32:56.280
So there is no fundamental reason why Git can't do this too.

32:56.280 --> 33:02.120
It requires a lot of plumbing and effecturing, but it's certainly a feasible thing.

33:02.120 --> 33:06.360
Assuming that we had plugable databases and that we could swap out the back end, the

33:06.360 --> 33:10.200
idea would be to introduce chunking into Git.

33:10.200 --> 33:14.080
With our current electrification logic, we have to do expensive calculations to find

33:14.080 --> 33:18.040
ideal data, which is simply too costly for binaries.

33:18.040 --> 33:23.400
With chunking, though, we can deduplicate common parts by cutting a large binary file into

33:23.400 --> 33:29.360
smaller chunks, and each of these chunks can then be deduplicated individually.

33:29.360 --> 33:33.800
There's two significantly different ways of doing chunking.

33:33.800 --> 33:38.120
The first and obvious way is to simply split a file into fixed size chunks.

33:38.120 --> 33:44.120
To on this example, use for example, cut the file after every fourth character.

33:44.120 --> 33:47.960
The problem, though, is that if you insert new data at any point in the file, then all

33:47.960 --> 33:51.160
the chunks that follow afterwards will now change.

33:51.160 --> 33:55.480
The result is that we cannot deduplicate those chunks.

33:55.480 --> 33:57.960
The alternative is contentifying chunking.

33:57.960 --> 34:02.680
The key insight of contentifying chunking is that boundaries are determined not by length,

34:02.680 --> 34:05.280
but by the content itself.

34:05.280 --> 34:09.200
Every time a specific property is triggered, we cut a new chunk.

34:09.200 --> 34:16.360
The result is that the file will cut the cut into chunks of variable length.

34:16.360 --> 34:21.120
So if you insert data at the beginning or anywhere in the file now, then the first chunk

34:21.120 --> 34:23.040
will of course change.

34:23.040 --> 34:27.000
But because the boundary is defined by the content, we know that we are still going to cut

34:27.000 --> 34:34.480
subsequent chunks at the exact same boundary, and this remaining chunks will remain identical.

34:34.480 --> 34:39.200
The mechanism used for this is to pick a rolling hash function over a sliding window.

34:39.200 --> 34:44.520
When the hash matches a condition like for example, being divisible by n, then we cut.

34:44.520 --> 34:47.600
This is how tools like for example, the rest is or bark.

34:47.600 --> 34:53.480
It also our clone can handle large file backups efficiently.

34:53.480 --> 34:56.640
We don't really need to replace its entire storage format.

34:56.640 --> 35:01.760
It works quite well for text files, and contentifying chunking would likely make compression

35:01.760 --> 35:04.440
ratios worse for them.

35:04.440 --> 35:09.880
We get already supports multiple object sources attached to it, so that you can use the

35:09.880 --> 35:13.320
alternatives mechanism to have two different storage types.

35:13.320 --> 35:18.760
The idea is to connect two object sources, and based on whether or not a binary file is

35:18.760 --> 35:24.280
a binary file, you would either store it in the chunked format, or you would store it using

35:24.280 --> 35:28.520
pack files.

35:28.520 --> 35:31.960
The two efforts to introduce large object promises a pluggable object that they are based

35:31.960 --> 35:33.960
on progress in parallel.

35:33.960 --> 35:39.000
The initial protocol implementation for large object promises has landed in get-to-50,

35:39.000 --> 35:42.040
and has been extended in get-to-52.

35:42.040 --> 35:47.200
The next steps are to automatically use filters and promises on the client side.

35:47.200 --> 35:50.440
Overall, this is quite close to being usable in production.

35:50.440 --> 35:55.360
I would assume that over the next couple of releases, we will have all the required parts.

35:55.360 --> 36:00.480
What is of course the missing is also support in getforges.

36:00.480 --> 36:04.440
The effort around pluggable object databases is not that far yet.

36:04.440 --> 36:07.840
Over the last couple of get releases, we have spent some significant time refactoring the

36:07.840 --> 36:11.560
code base and how get access is objects.

36:11.560 --> 36:16.680
Stouching with get-to-53, which will be released tomorrow, no, in two days actually,

36:16.680 --> 36:21.240
we will have a unifiesed object database in the face that makes it easy for us to change

36:21.240 --> 36:23.600
the format going forward.

36:23.600 --> 36:28.240
In get-to-54, I then expect that we will have an initial proof of concept, but implementing

36:28.240 --> 36:32.920
the chunk forward will probably take a little bit longer.

36:32.920 --> 36:37.040
Once those parts have landed though, get will become a lot more viable for large binary

36:37.040 --> 36:42.840
files without work around.

36:42.840 --> 36:46.040
The last couple of sections have been about technical details.

36:46.040 --> 36:50.920
One core area though that get does get a lot of complaints about is its UI.

36:50.920 --> 36:55.920
Many commands are extremely confusing, and some workflows are significantly harder than

36:55.920 --> 36:58.200
they have any right to be.

36:58.200 --> 37:02.680
And recently there's been a competitor that makes us have a hard look at ourselves and

37:02.680 --> 37:05.320
what we're doing.

37:05.320 --> 37:10.400
Judges is a modern version control system that's fully compatible with get-to-repositories.

37:10.400 --> 37:14.600
It will start to the couple years ago by MatiFunzvike back at Google, but by now it has

37:14.600 --> 37:17.320
a growing open source community.

37:17.320 --> 37:21.840
You can use it and exist in existing repositories, get-to-repositories, push to large

37:21.840 --> 37:26.240
forges like GitLab and GitHub, and your collaborators won't even have an idea that you're

37:26.240 --> 37:28.800
using JJ.

37:28.800 --> 37:33.520
Everyone knows that gets user experience is not exactly the most loved one, and indeed,

37:33.520 --> 37:38.440
many people seem to prefer the JJ's experience way more.

37:38.440 --> 37:40.880
It's of course not much of a surprise.

37:40.880 --> 37:45.720
The get-to-user interface has grown somewhat organically over the last two decades, which

37:45.720 --> 37:50.440
leads to inconsistencies and commands that simply don't feel modern.

37:50.440 --> 37:54.200
JJs is started from scratch, and it took all of the lessons that I get learned the heart

37:54.200 --> 37:58.040
right way directly to heart.

37:58.040 --> 38:03.440
As a get-to-repstopper, I was naturally quite curious, so I had a look at JJ quite early.

38:03.440 --> 38:09.680
I looked at it, found a confusing, and it's called it stupid.

38:09.680 --> 38:14.520
It just didn't make any sense to me at all, so I simply discarded it.

38:14.520 --> 38:18.560
But there was a steady influx of people who have seen the lights at a say.

38:18.560 --> 38:24.200
So I decided to eventually have another look, and that's been a finally clicked.

38:24.200 --> 38:28.880
That moment when you realize that a tool simply fixes all the UI issues that you had,

38:28.880 --> 38:33.520
and that you have been developing for the last 20 years, was not exactly great.

38:33.520 --> 38:38.840
But I had two options, either I can despair or I can learn from the competition, and

38:38.840 --> 38:42.120
I chose to learn from it.

38:42.120 --> 38:45.960
There's a couple of significant departures of from what get-tas.

38:45.960 --> 38:52.280
First, history is malleable by default, and you can basically shape your commits as you go.

38:52.280 --> 38:56.520
It's almost as if you were permanently in an interactive rebase mode, but without all of the

38:56.520 --> 38:59.480
confusing parts.

38:59.480 --> 39:04.600
Also when you re-write history, dependence update automatically, so if you added a commit,

39:04.600 --> 39:08.600
all children are rebased automatically.

39:08.600 --> 39:10.920
There is no special detached hat mode.

39:10.920 --> 39:16.200
In fact, you often often don't even have local name branches, so you're constantly working

39:16.200 --> 39:18.960
with detached hats, so to say.

39:18.960 --> 39:22.480
And also, conflicts are data, not emergencies.

39:22.480 --> 39:27.120
You can commit them, and resolve them at any later point in time.

39:27.120 --> 39:28.920
These are just nice to have.

39:28.920 --> 39:32.560
They fundamentally change how you think about your commits.

39:32.560 --> 39:36.680
You stop treating them as precious artifacts, and rather start treating them at drafts

39:36.680 --> 39:39.360
that it can freely edit.

39:39.360 --> 39:45.160
As said in the intro, get us old, so we cannot just completely revamp our UI, and those

39:45.160 --> 39:47.160
break all the workflows up there.

39:47.160 --> 39:53.000
But there are some things that we can definitely steal from JJ.

39:53.000 --> 39:57.960
The primary way to re-write history is by using git rebase, and specifically interactive

39:57.960 --> 39:59.760
rebases.

39:59.760 --> 40:03.560
But interactive rebases are making some tasks a lot harder than they have any right

40:03.560 --> 40:05.560
to be.

40:05.560 --> 40:07.840
One example is splitting up a commit.

40:07.840 --> 40:12.160
First, you need to figure out which commits you want to split up, and let's pretend

40:12.160 --> 40:16.960
we want to, for example, split up the commit, introduce A and B.

40:16.960 --> 40:20.160
You would now start an interactive rebase.

40:20.160 --> 40:22.440
This already causes the first confusing moment.

40:22.440 --> 40:28.920
In order to edit that commit, you have to rebase on top of its parent, not the commit itself.

40:28.920 --> 40:32.960
You're now presented with an instruction sheet in your editor, where you have to manually

40:32.960 --> 40:38.040
search for your commit, edit the instruction from pick to edit, and then save.

40:38.040 --> 40:42.280
You get used to it, but it's somewhat weird.

40:42.280 --> 40:45.760
You're now put on top of the commit that you want to edit.

40:45.760 --> 40:51.160
You have to undo it, because you want to split it up and trade to new commits.

40:51.160 --> 40:54.920
Be careful, you stitch the first file that we want to put into the first commit, and commit

40:54.920 --> 40:55.920
it.

40:55.920 --> 41:00.320
The original commit message is kind of gone at this point of time, except if you know that

41:00.320 --> 41:08.040
you can pass dash dash re-edit message equals hat at curly braces 1.

41:08.040 --> 41:11.040
Not exactly ergonomic either.

41:11.040 --> 41:15.160
We can edit another file now, and then we create the second commit.

41:15.160 --> 41:20.440
And finally, you conclude the action by saying get rebase dash dash continue.

41:20.440 --> 41:25.800
All of this requires 7 commands, editing an arcane instruction sheet, and some scary operations

41:25.800 --> 41:28.600
like discarding commits.

41:28.600 --> 41:32.480
And if you had other branches depending on this commit, well, they're no pointing at the

41:32.480 --> 41:33.920
old object at you.

41:33.920 --> 41:39.160
As you can see, the old introduced A plus B commit still exists in this history.

41:39.160 --> 41:43.920
And is referenced by both the other branches feature A and feature B.

41:43.920 --> 41:48.560
In Jiu Jitsu, all you have to say is JJ split, and then it asks you which changes should

41:48.560 --> 41:52.960
be part of what commit, and what the commit messages, and what the commit messages should

41:52.960 --> 41:53.960
be.

41:53.960 --> 41:59.920
I kind of get why people actually prefer this workflow.

41:59.920 --> 42:04.120
As mentioned, dependent branches would have to be rebased manually after an interactive

42:04.120 --> 42:05.320
free base.

42:05.320 --> 42:09.160
This is becoming a problem though, if you want to work with stacked branches.

42:09.160 --> 42:14.040
The style of working that is becoming increasingly more popular.

42:14.040 --> 42:19.920
Let's assume you want to build a new feature that consists of a couple of logical steps.

42:19.920 --> 42:24.360
The traditional workflow typically creates a single branch that contains each of these steps

42:24.360 --> 42:26.480
as individual commits.

42:26.480 --> 42:29.320
The end result is one big merge request.

42:29.320 --> 42:33.640
This is easy for the developer, but painful for the review, because they now have to

42:33.640 --> 42:38.520
read through the hundreds of lines of change.

42:38.520 --> 42:42.880
The alternative that gains more and more traction though is to have stacked branches.

42:42.880 --> 42:47.960
Instead of putting every commit into the same branch, you create a set of dependent branches.

42:47.960 --> 42:52.960
Each of these branches builds one small part of the bigger feature, and each of them uses

42:52.960 --> 42:55.640
a separate merge request.

42:55.640 --> 43:01.080
This overall of course requires more steps, but the review will now go a lot faster, because

43:01.080 --> 43:07.400
the changes that a other person needs to review are much smaller overall.

43:07.400 --> 43:11.480
The problem is that it makes maintaining these stacked branches quite painful.

43:11.480 --> 43:15.360
Let's say you for example work on top of feet off to address review feedback.

43:15.360 --> 43:19.960
You simply fix a typo in the commit message, and then suddenly all the dependent branches

43:19.960 --> 43:24.120
will be orphaned now as they point to the old version of feet off.

43:24.120 --> 43:28.400
You have to manually rebase feet API and feet UI.

43:28.400 --> 43:36.160
Do this a few times a day, and you will quickly abandon the stacked branch workflow entirely.

43:36.160 --> 43:40.120
We aim to solve these issues, and make stacked branch workflows easier with a new get command,

43:40.120 --> 43:42.120
get history.

43:42.120 --> 43:45.440
The goal is to have a couple of opinionated subcommands.

43:45.440 --> 43:48.800
These subcommands do one thing, and they do it well.

43:48.800 --> 43:53.520
Get history, we were, for example, we were the commit message of one of your commits.

43:53.520 --> 43:59.520
It's just like get commit dashed as a man, except for an arbitrary commit in your history.

43:59.520 --> 44:00.520
Get history split.

44:00.520 --> 44:02.520
We'll work just like JJ split.

44:02.520 --> 44:07.880
You give it a commit, get ass, which part should be part of what commit, and then you type

44:07.880 --> 44:11.280
into commit messages, nice and easy.

44:11.280 --> 44:15.640
Get history absorbed, takes all of your stage changes, and figures out automatically which

44:15.640 --> 44:20.080
commits to apply them to, and squashes them into those commits.

44:20.080 --> 44:24.080
All of these commands are heavily inspired by what JJ provides.

44:24.080 --> 44:28.200
In fact, JJ itself was also inspired by other version control systems.

44:28.200 --> 44:32.600
Get history absorbed, for example, and JJ absorbed have originally been implemented

44:32.600 --> 44:36.400
in material a long time ago already.

44:36.400 --> 44:41.040
This is, of course, only a start, and we plan to add more subcommands to it that make editing

44:41.040 --> 44:44.240
your commit history way easier going forward.

44:44.240 --> 44:49.720
Some examples include squashing a range of commits, dropping a specific commit, reordering

44:49.720 --> 44:52.400
them, and so on.

44:52.400 --> 44:57.320
The more important part, though, is that we don't only aim to make recurring tasks easier.

44:57.320 --> 45:02.560
This command also knows to automatically revise dependent branches.

45:02.560 --> 45:07.120
So let's revisit our example from earlier on by using get history instead.

45:07.120 --> 45:11.640
We again have the same state as before with get rebase.

45:11.640 --> 45:16.080
To split up the commit, we simply execute get history split with the commit ID.

45:16.080 --> 45:19.960
It now goes through all the changes one by one, and asks, should that be part of the

45:19.960 --> 45:21.960
first commit or not?

45:21.960 --> 45:28.040
So we answered the first question with yes, and the second question with no, which means

45:28.040 --> 45:32.760
that file a will be part of commit one, and file b will be part of commit two.

45:32.760 --> 45:37.040
If you confirm, get will now ask you for two commit messages.

45:37.040 --> 45:41.760
In both cases, it also knows to retain the original commit message so that you have it

45:41.760 --> 45:43.760
as context.

45:43.760 --> 45:48.240
But if we do another lock, you can see that the commit has been split up into two commits,

45:48.240 --> 45:52.880
but more importantly, you can also see that all of the dependent branches, main feature

45:52.880 --> 46:00.240
A and feature B, have been updated automatically to point to the new revert in commits.

46:00.240 --> 46:03.800
Get history has undergone a very long discussion on the get mailing list.

46:03.800 --> 46:08.680
The first version was posted in get 250 tier realized hack on August last year.

46:08.680 --> 46:13.920
Since then, it has been significantly worse with a lot of bike shedding and swan to

46:13.920 --> 46:17.680
handle stack branches, workloads way better.

46:17.680 --> 46:22.120
The initial architecture for get history has emerged upstream, and will likely be part of

46:22.120 --> 46:25.600
the next get release in about three months probably.

46:25.600 --> 46:31.200
For now, it only supports re-wording, but splitting up a commit will move into review next.

46:31.200 --> 46:35.760
More subcommands will follow, and in that context we will have a look at first class conflicts

46:35.760 --> 46:37.720
as well.

46:37.720 --> 46:42.360
Which are a central part of how jj works with stack branches.

46:42.360 --> 46:46.760
I'm also very certain that the get project will have a deeper look at what else jj has to

46:46.760 --> 46:48.080
offer.

46:48.080 --> 46:52.240
Not all of these changes will land and get, because the design is simply different,

46:52.240 --> 46:56.400
and as a consequence not everything that makes sense in the jj world also makes sense in

46:56.400 --> 46:58.080
the get world.

46:58.080 --> 47:02.640
But I'm very certain that there will be more to come.

47:02.640 --> 47:06.640
So this has been a little bit of a whirlwind through two or through what's happening and

47:06.680 --> 47:08.800
what's cooking and get right now.

47:08.800 --> 47:12.960
I hope you have learned a bunch of new things, and have a little bit of a clear picture

47:12.960 --> 47:16.320
of where the get project has had it to.

47:16.320 --> 47:20.760
We have been looking at the chart of the sixth transition, the refibled back and to story

47:20.760 --> 47:25.600
references, upcoming changes to improve large object support, and some of the upcoming

47:25.600 --> 47:27.600
history re-writing features.

47:27.600 --> 47:32.240
If you have any feedback or questions, I'm very happy to discuss them, so please just

47:32.280 --> 47:34.880
approach me on the hallway and have a chat with me.

47:34.880 --> 47:35.880
Thanks a lot for your attention.