WEBVTT

00:00.000 --> 00:12.400
So, I've always been working with databases, a lot of traditional databases, like DB2 or

00:12.400 --> 00:16.000
Wackel Postgres, using traditional Beatriz.

00:16.000 --> 00:22.760
I've also been developing a rocket for you, gabaite, that uses robs DB and SM trees.

00:22.760 --> 00:28.080
And now I'm developing a rocket for MongoDB, that uses right tiger for the storage that

00:28.080 --> 00:31.920
has as Beatriz, but a modern version of Beatriz.

00:31.920 --> 00:39.440
So, the goal is to compare not only the main differences, but the different optimizations

00:39.440 --> 00:44.480
that were made on it, because we can go fast and have only one slide and say,

00:44.480 --> 00:48.920
LSM tree is better to write, Beatriz is better to read.

00:48.920 --> 00:55.560
First, it's not completely true, both improve the two workloads, but also there is no

00:55.560 --> 01:01.480
database where you're only right, and there is no database where you're only read, usually

01:01.480 --> 01:07.360
you put something in the database with the goal of growing it later.

01:07.360 --> 01:15.960
And even if you write, if you have a writer, we work load on your database.

01:15.960 --> 01:21.680
When you write, you may read a lot also.

01:21.680 --> 01:28.880
For example, except if you, when you insert in a skill database, the database must check

01:28.880 --> 01:34.800
if the key already exists, or raise a narrow if not, it has to read for that.

01:34.800 --> 01:39.560
Even if you upset, you don't have to read, but if you have secondary indexes, you cannot

01:39.560 --> 01:42.960
just upset them, you have to write something.

01:42.960 --> 01:49.520
And think about the MVC visibility, think about checking if something is locked or not,

01:49.520 --> 01:54.080
checking the foreign keys, so finally, you may think that you have a write every workload

01:54.080 --> 02:00.120
because you're ingesting a lot of data, but the database is doing a lot of reads behind

02:00.120 --> 02:02.760
that.

02:02.760 --> 02:07.840
Something that is sure in databases, you need sorted structures.

02:07.840 --> 02:13.680
You can write to a log, you can write to it table, but if you want to query, you want

02:13.680 --> 02:23.440
to query some values, and then you need to get a structure to find those values.

02:23.440 --> 02:30.480
For point queries, the structure can use a hash or be sorted, for range queries, it must be sorted.

02:30.480 --> 02:37.080
So the data must be indexed because you query it, and it must be indexed with a sorted

02:37.080 --> 02:38.280
structure.

02:38.280 --> 02:42.800
And there is an additional constraint in most of databases.

02:42.800 --> 02:49.840
You have a lot of data, so this sorted structure, the index, cannot fit entirely in memory.

02:49.840 --> 02:56.120
We need a structure that has some optimization for some parts in memory, but also when

02:56.120 --> 02:58.320
it comes from disk.

02:58.320 --> 03:04.440
If I take an example of sorted data, so this data is sorted, each line is sorted, I think

03:04.440 --> 03:11.760
of it like blocks on disk, for example, they are not all sorted, but they are not overlapping,

03:11.760 --> 03:13.440
which is quite good.

03:13.440 --> 03:16.400
With this data, I can get it sorted quite easily.

03:16.400 --> 03:21.760
If I know where to start, I just need a pointer to the block where I need to start.

03:21.760 --> 03:28.440
And if in addition to that, in each block, I have a pointer to the offset of the next block

03:28.440 --> 03:29.840
logically sorted.

03:29.840 --> 03:33.760
I can just follow the pointers and get something sorted.

03:33.760 --> 03:37.960
But this is a lot of pointers to follow if it's on disk.

03:37.960 --> 03:40.240
This is a lot of latency.

03:40.240 --> 03:50.200
So to accelerate those search, we can put more metadata in memory to get faster to a specific

03:50.200 --> 03:51.680
value.

03:51.680 --> 03:59.560
For example, if I search for my name in that, I can pick a pointer, go to it, I see that

03:59.560 --> 04:05.980
it is after it, then I take another one that is later, it is before it, and with this

04:05.980 --> 04:12.260
deco to me, I can finally find my name with following less pointers just because I had

04:12.260 --> 04:15.300
a bit more information in memory.

04:15.300 --> 04:17.020
We can put even more in memory.

04:17.020 --> 04:21.100
We can put also not only some pointers, but some values.

04:21.100 --> 04:27.060
For example, the skip list where I can read the list and see that everything that starts

04:27.060 --> 04:34.620
after F, I can skip directly to a pointer there and go directly to the block.

04:34.620 --> 04:37.580
So that's really effective.

04:37.580 --> 04:42.900
Having a structure that is optimized in memory, I can skip list and having a structure

04:42.900 --> 04:50.140
on disk where I don't have to follow too many offsets in the file, too many six in the

04:50.140 --> 04:51.980
files.

04:51.980 --> 04:58.280
And we can go further with a bit free because the problem is at some point the metadata

04:58.280 --> 05:05.240
that you put in memory to be efficient and if you have a lot of data may not fit entirely

05:05.240 --> 05:06.240
in memory.

05:06.240 --> 05:13.240
So the idea of B3 is to have multiple levels of metadata before going to the final

05:13.240 --> 05:16.960
block that holds your data.

05:16.960 --> 05:25.120
This is a B3 where at the top, I have root information with some value that will tell

05:25.120 --> 05:31.360
me where I can find more detailed information about the ranges and if I search again

05:31.360 --> 05:37.600
for a value there, I start at the root, there is only one root, I know where to start.

05:37.600 --> 05:45.720
And then I see that the F, I may have more information in another block at a lower level.

05:45.720 --> 05:52.880
And finally, after some levels, I can go directly to the block that holds my value.

05:52.880 --> 05:56.400
So this is quite efficient to read.

05:56.400 --> 06:05.360
The idea of the B3, the B3 plus 3 is that the data is only on the leaf nodes and the

06:05.360 --> 06:13.480
internal nodes, the root and the branches, just help to go to the right leaf.

06:13.480 --> 06:22.440
And initially the idea of a B3 was that the same format was used for the branches,

06:22.480 --> 06:28.040
the internal nodes and the values and it's just those that are read more often that stay

06:28.040 --> 06:32.520
in memory, like probably the root will be in memory.

06:32.520 --> 06:37.960
But there are some betrays with more optimisation when it is in memory.

06:37.960 --> 06:40.600
Before that, this is a point query.

06:40.600 --> 06:45.680
If I want to do a range query, I need some more information.

06:45.680 --> 06:51.760
For example, like we have seen before, a pointer to the next one so that I can just follow

06:51.800 --> 06:56.000
to the next one to do my one scan.

06:56.000 --> 07:03.640
Lot of B3s use that, but on the right, it means that those pointers must be updated.

07:03.640 --> 07:06.360
And when you update, you need to lock.

07:06.360 --> 07:13.160
There is an alternative, for example, wide tiger B3s, they do not store links between the

07:13.160 --> 07:17.600
leaves to one scan horizontally.

07:17.600 --> 07:26.160
They go back to the branches to see the next block, which is more efficient on the right

07:26.160 --> 07:32.320
because you don't have to update those pointer, less efficient on reads, but the optimisation

07:32.320 --> 07:39.120
on the read there is that wide tiger doesn't really put the offset on files there.

07:39.120 --> 07:46.160
It adds real burn tells addresses in memory when the internal nodes are in memory.

07:46.160 --> 07:52.840
So more optimisation for the branches for the internal nodes and navigation for it to avoid

07:52.840 --> 07:57.400
writing too much on the others.

07:57.400 --> 08:03.560
The big problem with B3s is that everything has a place and this is why it is easy to find,

08:03.560 --> 08:08.960
but if everything has a place, an insert must go to a specific block and when this

08:08.960 --> 08:15.200
block is full, then we need to make some room on it, like split it, and the big problem

08:15.200 --> 08:18.320
is that we have multiple pointers to update.

08:18.320 --> 08:23.760
So you see, for example, that this one is the one that we want to avoid to update in right

08:23.760 --> 08:29.120
tiger in traditional B3s, it is updated there.

08:29.120 --> 08:37.760
So to go faster on wide, because this involves a block split sometimes for some insertion,

08:37.760 --> 08:44.640
to go faster, the idea of the LSM tree is that there is not this constraint that a value

08:44.640 --> 08:52.320
is at a specific place, you see multiple, it's not blocks, it's larger, it's multiple files,

08:52.320 --> 08:59.400
that are sorted inside them, but they have overlapping values and it's not only files,

08:59.400 --> 09:06.840
there is also one in memory, which means that if I'm looking for a specific value, I need

09:06.920 --> 09:13.640
to read from multiple places, so you see how it is less efficient for reads, because you need

09:13.640 --> 09:21.400
to read from multiple places, but again, this can be optimized, for example, in RoadsDB and

09:21.400 --> 09:27.240
in many LSM trees, you have Bloom filters that can just tell you this value, no need to look

09:27.240 --> 09:33.800
at this file, it's not in this file, there are also, for example, in RoadsDB, at least in

09:33.880 --> 09:40.600
UGABITDB improvements on RoadsDB, the minimum value and maximum value of which columns

09:40.600 --> 09:46.920
is stored, so that maybe you can just keep some files, so everything can be faster, but finally,

09:46.920 --> 09:53.240
it has to be read from multiple places and do a merge of it and of sort merge to preserve

09:53.240 --> 10:00.840
the order there, but when you write, it's easier, because there is no specific place, you can

10:00.920 --> 10:05.560
write where you want, then you choose the first term, you write in the memory table, that is

10:05.560 --> 10:11.240
on top of it, and of course you are looking at a specific size for it, so when it is full,

10:11.240 --> 10:18.680
the memory table will be flushed to a file and you can use again another memory table for that,

10:19.400 --> 10:26.040
so it's a flushed in the background to more file, then it's increased the number of files,

10:26.040 --> 10:29.560
you don't want to increase the number of files too much, because when you read it,

10:29.560 --> 10:35.640
you have to read a lot of files, so there is a background compaction that happens that reads a few

10:35.640 --> 10:46.120
files, a few small files to build a larger file and have less to read, but again, so this is additional

10:46.120 --> 10:53.000
work, but again there are some optimizations, for example, in Postgres, anyway, you have vacuumed

10:53.000 --> 11:00.600
that has to read data, so this can be offloaded to the compaction, so the compaction is additional

11:00.600 --> 11:06.440
work, but can also do some work that the database has to do, maybe you can gather statistics

11:06.440 --> 11:13.640
during that, you can do a lot of things that database must do in the background, so the compaction

11:13.720 --> 11:23.640
runs there to have less files, okay, so to summarize, when we compare big difference,

11:24.680 --> 11:30.440
the bitery as a specific place for our value, the other cemetery may have multiple places,

11:30.440 --> 11:37.000
so the bitery is maintained sorted on white, because it is written at a specific place where the

11:37.000 --> 11:42.600
LSM tree is partially maintained sorted, because you insert in the memory table and it can be

11:42.600 --> 11:49.960
sorted in memory easily, but you need compaction and you need the merge source when you read it

11:51.400 --> 11:58.120
at the end, so this is why we say that bitery is faster for reads and LSM trees are faster for

11:58.200 --> 12:08.040
writes, but there are many optimizations, bitery can be efficient on write, if a lot of branches

12:08.040 --> 12:13.640
are in memory, if you have to follow the free of all levels of the bitery and go to this, that's

12:13.640 --> 12:21.320
too much, so usually, bitery needs memory, LSM tree, you may allocate less memory for them,

12:22.280 --> 12:28.680
which we have the space amplification depending on what you write, if you write random values,

12:29.320 --> 12:38.520
it will go to random blocks and then those will be full or split and then alpha full, so on

12:38.520 --> 12:46.440
other edge you have 25% of free space there, but again, you can optimize it first all that

12:46.520 --> 12:54.680
bases that use bitery, have some ways where it can recompact or rebuild the index on 9,

12:56.040 --> 13:04.200
and also on the application, you try to avoid random values like a random UID for a primary

13:04.200 --> 13:11.160
key, if you don't want to go to this space amplification, space amplification is not only about

13:11.160 --> 13:19.000
the space, it means that it will use also more memory to get things in case, and for the

13:19.000 --> 13:24.440
write amplification, that happens when blocks are split, again, there are many optimizations in

13:24.440 --> 13:30.360
white tiger, for example, there are many optimizations where it doesn't update a leaf in place,

13:30.360 --> 13:38.280
it does some kind of copy on write for that, and not having pointers between blocks also

13:39.080 --> 13:46.920
to avoid locks or latches when it reads the structure, and also the same trees have their

13:47.800 --> 13:54.680
own optimizations, as I mentioned, the red amplification with blue filter, minmarks, also you

13:54.680 --> 14:03.240
go by the BDB decide to use only one level of the LSM tree to optimize the red and the

14:03.240 --> 14:10.440
compaction, space amplification for LSM tree comes when, for example, when it has to compact,

14:11.000 --> 14:18.040
you need double space for it, so for one big database you don't want to compact all files,

14:18.680 --> 14:24.200
but sometimes you need to compact all files, because maybe you inserted a very long time ago,

14:24.200 --> 14:29.160
and you deleted recently, so you inserted a tombstone that is in a different file,

14:30.120 --> 14:40.040
but if you do shorting like a uGabaret or TIDB, they showed before storing in the LSM tree,

14:40.040 --> 14:45.080
if you do shorting, you control the size of it, and then the compaction, you can do a full

14:45.080 --> 14:54.760
compaction on a moment of data that is limited, for the write amplification, that is low when you

14:54.840 --> 15:02.680
write to the memory table, but the compaction needs to keep up with that, because you cannot

15:02.680 --> 15:10.280
allocate as many as memory table as u, and if it is not flushed and compacted efficiently behind

15:10.280 --> 15:17.160
it, but again there are a lot of things that can be off-loaded to the compaction and optimize it.

15:18.120 --> 15:26.680
If I compare more specifically, but I think I will go faster on that, the optimizations that

15:26.680 --> 15:35.160
were made on write-dagger B3, so I mentioned a few of them, the log 3 algorithm is quite nice,

15:35.960 --> 15:42.040
it was really built for databases with multiple threads updating data,

15:43.000 --> 15:50.280
and also the usage of it, for example, write-dagger is used in MongoDB, when it generates,

15:50.280 --> 15:55.560
it doesn't store directly in the primary key index,

15:57.080 --> 16:02.600
using the primary key from the application, it generates a sequence that is increasing,

16:02.600 --> 16:08.920
so you don't have this space amplification, so the way you use it is also a possibility where

16:08.920 --> 16:15.160
it is optimized, and what's the being you get by it already mentioned a few of them, but for example,

16:15.800 --> 16:22.200
they support only assess this, because the redamplification, if you add the latency of making

16:22.200 --> 16:29.800
a mechanical disk, then we complicated, but on-modern storage that works, and something that is

16:29.800 --> 16:37.640
really nice for databases on a essentially, you have incremental backups coming from it,

16:37.640 --> 16:42.920
because the files, the assesses files are imitable once they are flushed, so your incremental

16:42.920 --> 16:47.160
backup, your snapshot, is just keeping the files, that's quite nice.

16:49.000 --> 16:58.040
Something also to keep in mind about the storage, so the the mechanical disks used to be slow

16:58.040 --> 17:04.920
for everything random with unwrites, this is why we try to do as much as possible in memory,

17:05.000 --> 17:10.440
because the random access is faster, and to do things more sequential when writing to wall

17:10.440 --> 17:17.400
or at checkpoint, we assess this, the reads are faster, so you can accept a bit more read

17:17.400 --> 17:25.160
amplification, but the writes can still be slow, so it's something that is optimized,

17:25.160 --> 17:32.600
by transforming the random writes to sequential writes, because it writes to memory, and then the

17:32.600 --> 17:41.080
flush is sequential, the compaction is sequential. So I started with this, because I wanted to be sure

17:42.760 --> 17:51.480
that I mentioned it, writes intensive workload, so whether the same tree, the writes is faster,

17:51.480 --> 17:57.240
because you write directly to the main table and the all compaction can keep it behind, you write

17:57.240 --> 18:04.360
on it to the main table, where for Beatriz, you have to write to the cache, maybe in memory,

18:04.360 --> 18:10.600
but you may need to read some blocks, the blocks that you need, the block where the value goes,

18:10.600 --> 18:21.000
so additional work, but not all workloads are write intensive, and the same trees were really made

18:21.000 --> 18:26.360
for big data, where you upset, where you don't have secondary index, for things like

18:26.360 --> 18:32.120
Cassandra, for example, but on a WLTP on SQL databases, as I mentioned at the beginning,

18:32.120 --> 18:38.120
you read a lot, you need to read a lot even when you insert, which means that with the Beatriz,

18:39.400 --> 18:46.520
you already have this read done, so this additional thing that you do when you write,

18:46.520 --> 18:52.440
you need it to read, where in other centuries the reads goes to a completely different path,

18:52.520 --> 18:59.240
and even a different cache, so it is an overhead there, so need to think about it, it's not as easy,

18:59.240 --> 19:08.440
as I do a lot of writes, I will pick a database with the same trees, those are examples of

19:08.440 --> 19:14.520
databases using Beatriz or the same trees, just to show that it's not only the old databases,

19:14.520 --> 19:21.240
that use Beatriz, for example AWS has all kinds of databases, most of them use Beatriz,

19:22.200 --> 19:27.960
and in the SM trees, of course, you see the big data databases, but you see SQL databases,

19:29.880 --> 19:35.320
so finally, so this is my LinkedIn, if you have any question later, the main message is that

19:35.320 --> 19:41.240
yeah, SM trees are faster for writes, but your database is probably doing more reads than writes,

19:41.240 --> 19:47.320
even if you think that your application is writing a lot, SM trees are very complex to tune

19:47.400 --> 19:53.640
there are a lot of parameters to control, the compaction is aggressive or not,

19:55.560 --> 20:02.920
so they can be complex to tune, where Beatriz may look simpler, but be careful with what you do,

20:02.920 --> 20:10.600
be careful with the random keys, for example, and there are different implementations,

20:10.760 --> 20:17.800
I was really surprised by writes, take an implementation, because it has a lot of

20:17.800 --> 20:23.960
optimizations compared to older databases, it was really made up for a time where multiple

20:23.960 --> 20:29.800
trade access the same data, and the most important is to understand that it works and understand

20:29.800 --> 20:35.480
the trade-off, there is not one that is better than another, but the rest is if you use one

20:35.480 --> 20:42.840
without really understanding what happens, be I lead. Okay, thank you.