WEBVTT

00:00.000 --> 00:11.840
So, welcome everyone. It's quite a sizable audience for a database benchmark talk on a Sunday

00:11.840 --> 00:17.360
afternoon. My name is Gabber Sarnyos and in the following 30 minutes I will talk about database

00:17.360 --> 00:22.280
benchmarks and what I learned about them while running a benchmarks standard organization.

00:22.280 --> 00:28.440
A little bit about me. I spent a decade in academia, then I was building up a benchmark

00:28.440 --> 00:33.800
non-profit called the IDBC and now I work at a database vendor. An overarching theme of my

00:33.800 --> 00:40.680
career is doing database benchmarks and using them. Today I'm here in my benchmark non-profit

00:40.680 --> 00:47.720
head and I will try to share the lessons that I learned as such. The talk is five parts. I will

00:47.720 --> 00:52.920
talk about the need for database benchmarks. Then I will talk about two benchmark organizations,

00:53.000 --> 00:59.480
the TPC and the IDBC. Then I will review some popular benchmarks in modern times and I will

00:59.480 --> 01:05.160
leave you with a bunch of takeaways. So, let's get started. Why would you benchmark databases in the

01:05.160 --> 01:10.840
first place? It turns out that databases and benchmarks are a pretty good match. First of all,

01:10.840 --> 01:16.440
users usually care about performance. They spend a lot of time optimizing their workloads and if

01:16.440 --> 01:21.480
the database can do this for them or if the database allows them to go further, they are quite

01:21.560 --> 01:26.680
happy about that. So, the vendors are incentivized to have good performance or at least to claim

01:26.680 --> 01:34.840
good performance. Databases are also mostly deterministic. Benchmarking the database seems reasonably

01:34.840 --> 01:39.960
straight forward because you get usually the same query plan and some similar execution time.

01:39.960 --> 01:45.000
So, it's not like benchmarking a large language model which is the lot more stochastic than a database.

01:46.040 --> 01:50.360
So, at this point you may expect that you have a data set to define a bunch of queries,

01:50.360 --> 01:56.040
you throw it at the database and then you write down the numbers of the databases performance

01:56.040 --> 02:03.400
and this will somehow magically make database benchmarks work. Unfortunately, reality is a bit more

02:03.400 --> 02:09.160
complicated than this. If you just do data set and queries, you will not really stress

02:09.720 --> 02:15.080
updates in the database which is a crucial part of any database system. If you just run the same

02:15.080 --> 02:19.880
queries over and over again, you get to limit it understanding of how the optimizer works.

02:19.880 --> 02:25.240
So, you should also plug in different parameters into those queries and you need to somehow

02:25.240 --> 02:31.480
tie all this together into a workload definition. The database is of course a piece of software that

02:31.480 --> 02:36.120
doesn't exist in isolation. You have all these layers under the database, the five system,

02:36.120 --> 02:43.560
the operating system, hardware, natural components, maybe a container, an emulator, some virtualization

02:44.360 --> 02:48.760
layer, these all have interplays with each other that can have an effect on your results.

02:49.480 --> 02:55.400
And finally, performance is important, but correctness is also important. You have to verify that

02:55.400 --> 03:00.600
the results are correct and usually pricing is something that users deeply care about, so you need

03:00.600 --> 03:08.840
to assess the pricing of the system. And speaking of being deterministic, yes, databases are most

03:08.920 --> 03:13.960
deterministic, but reproducibility is still difficult. It's quite difficult to hit

03:13.960 --> 03:21.080
reliably the same code path in a database system. And you have all these extremes of different runs,

03:21.080 --> 03:26.280
cold runs, which are incredibly difficult to do because you basically have to restart the computer,

03:26.280 --> 03:32.360
restart the database and then run the query to get a true cold run. And hot runs, which are

03:32.360 --> 03:37.960
easier to achieve because you just run the same thing in the loop, but maybe they are not that relevant.

03:38.040 --> 03:43.480
So when you are measuring systems, you are kind of in between, lukewarm and warm runs in a

03:43.480 --> 03:51.400
system, and there can be orders of magnitude differences between these runs. Another surprising

03:51.400 --> 03:56.600
difficult thing is that once you have done your benchmarking and you maybe have 100 million

03:56.600 --> 04:01.960
little entries in your benchmark report, you would need to write them up in some sort of a summary

04:02.600 --> 04:07.160
agregust statistics. And there is an excellent paper from Torsten Huffland Roberto Bell,

04:07.160 --> 04:13.160
either they explain how to report these in a parallel computing system, which is of course

04:13.160 --> 04:19.160
every computing system these days. And I cherrypicked two rules for you. Rule three says that

04:19.160 --> 04:25.480
you should only use arithmetic mean, the average for summarizing costs. If you are summarizing

04:25.480 --> 04:30.760
grades, like transactions per second, you should use harmonic means. And if you are summarizing

04:30.840 --> 04:36.600
ratios, first of all you shouldn't, you should go back and dig out the original values from

04:36.600 --> 04:43.160
the base data and then summarize that. But if you only have these ratios or speed-ups available,

04:43.160 --> 04:48.600
then you should use the geometric mean. So we basically use all the Pythagorean means just to report

04:48.600 --> 04:57.000
on some sort of an average in terms of costs and execution. Another thing is what to measure.

04:57.000 --> 05:02.440
Microbenchmarks are quite popular in research. You measure one join operator or maybe just a simple

05:02.440 --> 05:07.480
hash-proving. This is very interesting if you are implementing that operator, but for users,

05:07.480 --> 05:13.080
it's more interesting to have macrobenchmarks, load the data and run a bunch of queries, or even

05:13.080 --> 05:19.480
application-level benchmarks that measure the database systems performance and to end with complex

05:19.480 --> 05:27.960
updates, which may be a more intertwined workload. So, this is all quite difficult even if you

05:27.960 --> 05:32.680
have the best intentions, but let me tell you in the 1980s, database vendors did not have the

05:32.680 --> 05:38.760
best intentions. In this period, relational databases were still kind of immature and had major

05:38.760 --> 05:44.520
performance problems and something called benchmarking was quite common. This means that benchmarks

05:44.520 --> 05:50.040
were misused, vendors were even implementing their own benchmarks and surprisingly they've owned

05:50.040 --> 05:55.240
their own benchmarks and they both did about these results. An interesting chapter in this story

05:55.240 --> 06:00.760
is 1982, their professor David Duitt measured a bunch of database systems, including Oracle.

06:01.480 --> 06:07.880
He published those results and that didn't really make Oracle happy, so they put a close into

06:07.880 --> 06:13.880
their license saying, you are not allowed to publish results unless they are officially sanctioned

06:13.880 --> 06:20.440
by the company, and this became so popular that this is called Duitt close based on this little event.

06:21.640 --> 06:27.640
But by the end of the 1980s, it was kind of clear that everyone doing their own benchmark is not

06:27.640 --> 06:33.880
really doing any good, and there is a clear need for independent authority and the benchmark standard,

06:33.880 --> 06:40.840
and thus the TPC was born. The TPC, somewhat confusingly stands for Transaction Processing Performance

06:41.720 --> 06:47.640
This is a non-profit company that is based in the United States, and it was founded in 1988

06:47.640 --> 06:54.360
with the mission to make standards for benchmarks and then to make official results for those standards

06:54.360 --> 07:00.920
available. TPC's membership has fluctuated over the years, at its peak it was around 40 members,

07:00.920 --> 07:07.480
now it's at 21 members, and you can see it's heavily dominated by companies based in the United States

07:07.480 --> 07:13.400
and in China. With regards to the type of companies that are members, that's also worth mentioning,

07:13.400 --> 07:19.080
you have the hardware vendors like NVIDIA, AMD, and Intel, you have these server-bedding companies

07:19.080 --> 07:24.680
like Lenovo, Dell, and HPE, and you have cloud vendors, you have database vendors, so it's kind of an

07:24.680 --> 07:31.640
interesting mix of companies. And what did they do? I picked TPC-H, which because this is their most

07:31.720 --> 07:38.840
influential and most popular benchmark as a good example. TPC-H was released in 1999, and it

07:38.840 --> 07:44.360
captures an ad hoc analytical workload for a whole sales supplier. If you look at the schema,

07:44.360 --> 07:50.280
you have a supplier, you have parts up, you have line items for the entities that you are selling,

07:50.280 --> 07:56.840
and you have 22 queries that do analytics over these tables. It also has some refresh operators

07:56.840 --> 08:04.760
for inserting and deleting new data. The pillar of TPC-H is the data generator, the data generator

08:04.760 --> 08:11.000
can create data sets in different scale factors. A scale factor is defined as the size of the data

08:11.000 --> 08:17.960
set in GBBIC, so when we say scale factor 100, we mean it's 100 GBB of data. This is actually a

08:17.960 --> 08:23.240
really good way to measure it, because CSV files are kind of timeless, it's really easy to understand

08:23.240 --> 08:29.000
that it's a plain text serialization, and even though these days we are mostly using binary formats,

08:29.000 --> 08:34.920
this is still very easy to understand. The original data generator was written more than 25

08:34.920 --> 08:40.760
years ago, so it's written in C and it's not threat safe, but luckily a bunch of people last year

08:40.760 --> 08:47.320
rewrote it in Rust, and you can now just pip install it and run it basically anywhere through Python.

08:47.320 --> 08:51.000
And this is a really neat piece of software, and I think this is worth mentioning, because if you

08:51.000 --> 08:57.000
ever need like 100 GBB of data at your disposal on your laptop, you can just run these two commands,

08:57.000 --> 09:02.520
and if you produce somewhat meaningful data set in less than a minute, so this is even interesting

09:02.520 --> 09:10.040
if you're not into database benchmarking per se. The queries of TPC are obviously quite involved,

09:10.040 --> 09:14.840
I don't have time to go through all of them, but here is a really interesting example TPC-H

09:14.840 --> 09:21.800
query one. This is a rather simple query. It uses a single table line item, it uses some simple

09:21.800 --> 09:27.640
filtering, and it uses a bunch of simple aggregates like some average and count, and I like this

09:27.640 --> 09:34.360
query because this is something that realistically every analytical database should be good at.

09:34.360 --> 09:41.240
It doesn't have anything complex, you can implement it as a benchmark query in maybe a few minutes,

09:41.320 --> 09:45.720
and if you just start out building an analytical database system, you can start benchmarking

09:45.720 --> 09:52.360
it with this query quite early on in your journey of building the system. The way TPC benchmarks

09:52.360 --> 10:00.120
works is that it has three distinct faces. You first load the data to your database, then you do

10:00.120 --> 10:04.840
something called a power test. The power test means that you take the 22 queries and the refresh

10:04.840 --> 10:10.280
operators and run them sequentially. This allows you to measure whether the database can do

10:10.280 --> 10:16.920
inter query parallelism, whether it can exploit multiple cores for the same query. Then you compute

10:16.920 --> 10:23.000
the geometric mean run time of these queries and you get a power score. Next up is the throughput test.

10:23.000 --> 10:29.480
This is concurrent, you run queries and updates concurrently and try to achieve maximum throughput.

10:29.480 --> 10:34.840
And then finally, these are put together in a composite metric by taking a geometric mean of the two.

10:34.840 --> 10:41.800
And this weirdly named metric QPH is something that we used to track the performance of

10:41.800 --> 10:48.200
analytical workloads. I dug this out of the official TPC results and it actually shows a really nice

10:48.200 --> 10:54.840
trendline starting from 2001 until 2020. And what you can see is that basically the trendline

10:54.920 --> 11:04.600
starts around 5,000 QPH points. And now it's above 5 million. So this is a thousand X or 41%

11:04.600 --> 11:10.360
year on your improvement in terms of analytical performance that is at your disposal when you buy

11:10.360 --> 11:18.200
a state of the art database system. TPC H has been so influential that people try to do some reverse

11:18.200 --> 11:23.320
engineering on it and break it down on first of all why it is such an efficient benchmark.

11:23.320 --> 11:29.320
And second of all, how you should excel on it if you're a database vendor. So if you read this paper,

11:29.320 --> 11:34.680
it defines these choke points, these kind of well-defined bits of difficulty that the benchmark

11:34.680 --> 11:41.800
in codes. And you can see how the improvements on those choke points would improve your results on TPC H.

11:41.800 --> 11:47.320
So for example, if you improve aggregation, it will have an effect on all of the queries

11:47.320 --> 11:52.280
and the more outside the fact on some of the queries. If you improve joint performance,

11:52.280 --> 11:57.480
you will not impact queries 1 and 6, but you will deeply impact queries 9 and 18.

11:59.480 --> 12:05.160
Another benchmark from TPC is TPC DST and I brought this more of as a cautionary detail.

12:05.160 --> 12:11.080
This is a more modern and improved version of TPC H, so it's for analytics. It models a retail

12:11.080 --> 12:17.800
product supplier with multiple sales channels, it has dimension tables, fact tables and it's more

12:17.960 --> 12:24.920
realistic than TPC H in general. The problem with TPC DST is that it has 99 queries and on average,

12:24.920 --> 12:31.960
they are almost 2,000 characters per query. So it's 200,000 characters almost just for the query definitions.

12:32.520 --> 12:37.800
The deletes are also more complex. You have these cascading billetes from one table to another.

12:37.800 --> 12:42.920
If you delete something from sales, it will scatter into the returns fact table, so they are more

12:43.000 --> 12:49.080
difficult to implement efficiently. As a result, even though TPC DST was released in 2011,

12:49.800 --> 12:55.720
the first audit took seven years for someone to overtake. It was done by Cisco and it was the

12:55.720 --> 13:04.840
scale factor 10,000 result. In the next year since, there was a scale factor 100,000 result by Oli Baba,

13:04.840 --> 13:12.040
and in 2021 there was a record breaking result on TPC DST by Databricks on the 100,000 scale

13:12.120 --> 13:18.200
factor. This actually triggered a subsequent benchmark war with Snowflake, with Databricks and Snowflake

13:18.200 --> 13:25.000
putting out press releases, blog posts and social media on why their system is better, more efficient,

13:25.000 --> 13:29.800
and better in terms of price performance. This is really interesting, great, and I'm also

13:29.800 --> 13:37.880
happy to chat about this after the talk. So I said that TPC acts as an independent authority and

13:37.880 --> 13:43.480
they have official results. The way you achieve it is that they have an auditing process that you

13:43.480 --> 13:49.080
have to undertake. If you want to audit your database, then you become a test sponsor and you commission

13:49.080 --> 13:54.040
this audit from the TPC. You have to run the experiment yourself and you have to write it up

13:54.040 --> 13:59.800
in the full disclosure report yourself, and then you hand it over to a certified TPC auditor

13:59.800 --> 14:06.360
who validates the result. They have pretty high powers, they can not only rerun your experiments,

14:06.360 --> 14:11.640
but they can do additional checks. Back in the day, they could fly to your server farm and

14:11.640 --> 14:16.520
plug the server out of the socket to see whether the durability of the server is really there.

14:17.720 --> 14:24.120
And they do all sorts of these checks to make sure that there are no benchmark specials in your

14:24.120 --> 14:30.840
system. What is a benchmark special? Well, vendors are incentivized to do well on TPC benchmarks,

14:30.840 --> 14:37.080
or sometimes they built in operators just to do a TPC query well. And the way the auditor is

14:37.080 --> 14:43.080
trying to catch this is, for example, by Gorbling some of the queries. So this is Query1.

14:43.080 --> 14:49.080
And this is Query1 with all the column names and all the table names replaced with some random

14:49.080 --> 14:55.400
encoding. This prevents systems from using hard coded highly optimized operators and query plans

14:55.400 --> 15:03.240
if they use this method of detecting which TPC query is running. An important part of TPC

15:03.240 --> 15:10.360
benchmarks is pricing. TPC pricing is a specification that's almost 70 pages and it captures

15:10.360 --> 15:16.360
how you should price the total cost of ownership for a given database system. And this has three major

15:16.360 --> 15:21.640
components. The first is a software license, which can be zero if it's free and open source.

15:22.040 --> 15:29.480
The three year cost for the hardware and the cloud service if you use a cloud vendor. And then you

15:29.480 --> 15:35.640
have a three year maintenance with enterprise great support. Now I personally have two main criticism

15:35.640 --> 15:41.080
with this pricing. The first of all is that I believe the enterprise support is very strict.

15:41.080 --> 15:47.480
It's quite common, especially in analytics to run database systems without 24-7 enterprise support.

15:47.480 --> 15:53.480
So more of a next business this day support would be sufficient and it would keep the prices

15:53.480 --> 15:59.720
more realistic. But my bigger problem is that this is really an outdated way of thinking about

15:59.720 --> 16:08.040
databases. People who run databases in the cloud where the storage and compute are disaggregated

16:08.040 --> 16:13.240
and you can elastically scale the compute depending on your needs. Don't rent out computers for

16:13.240 --> 16:19.320
a fixed three year period. And you can actually see the effect of this in the pricing that

16:19.320 --> 16:24.360
pricing's that are reported. For example, the Databricks audit that I have mentioned from five

16:24.360 --> 16:32.120
years ago, they run it on 256 machines and because of this, the TPC pricing specification just

16:32.120 --> 16:38.040
assumes that you would run this out for a three year period and this yields a half to some of

16:38.120 --> 16:43.960
five point two million dollars. This is obviously not realistic number for doing data processing

16:43.960 --> 16:51.400
on 100 terabytes. 100 terabytes is not that much in these days. One interesting thing that I noticed

16:51.400 --> 16:59.000
is that I put it out the TPC number of audits in the last 15 years or so and they have a moderate

16:59.720 --> 17:06.680
negative correlation with the US federal funds interest rates. This is because the closer we are to

17:06.680 --> 17:11.560
zero interest rates, the more money is available for venture capital investments and when there

17:11.560 --> 17:17.000
is money in the VC market, the vendors are rushing to get TPC audits and try to one at each other

17:17.000 --> 17:22.680
in terms of performance in the hope of getting more venture capital. So that's kind of an interesting

17:22.680 --> 17:30.280
observation, but of course it's not the full picture. All right, so we arrived to LDBC. What is LDBC?

17:30.280 --> 17:37.720
Well, it is a non-profit founded in 2013, registered in the UK and its goal is to accelerate

17:37.720 --> 17:43.400
progress in graph data management by facilitating this sort of pro-competitive work, where you

17:43.400 --> 17:49.400
collaborate on standards and then try to be the best one at implementing that standard. The name

17:49.400 --> 17:57.240
LDBC reflects this focus on RDF processing back in 2013, so it's the linked data benchmark council.

17:57.800 --> 18:04.920
Back in 2011, when the idea of LDBC came up, it was very similar to the 1980s of relational

18:04.920 --> 18:10.520
databases. Graph databases had major performance issues, there was also no standard graph

18:10.520 --> 18:16.440
curve language and there was just no common understanding of what a graph database system should do.

18:16.440 --> 18:22.600
So some vendors and academics got together and created this organization. Today in 2020

18:23.400 --> 18:28.520
we have 18 members, you can see that again is dominated by companies based in the United States

18:28.520 --> 18:34.840
and China and we also have cloud vendors, we have software vendors database vendors and consultancy

18:34.840 --> 18:43.960
companies. LDBC is heavily inspired by TPC, so we do complex application level database benchmarks,

18:43.960 --> 18:50.120
we define data generators that can do different scare factors up to tens of thousands of

18:50.120 --> 18:59.560
terabytes and tens of terabytes and we design our benchmarks even based on the TPCH chokepoint

18:59.560 --> 19:05.720
methodology, so we took the TPCH chokepoints, we extended them with graph specific chokepoints

19:05.720 --> 19:10.600
and then we made the benchmarks such that they covered these chokepoints. We have a stringent auditing

19:10.600 --> 19:18.440
process and we have adopted the TPC pricing specification. Let's take a look at our most popular

19:18.600 --> 19:26.600
benchmark, the social network benchmark. This uses a data set that's capturing the Facebook data set

19:26.600 --> 19:35.960
in terms of distribution and in terms of the number of edges a given person has. It's actually

19:35.960 --> 19:43.720
based on a 2011 paper. Of course we not only have people in this network, we also have the content

19:43.800 --> 19:49.480
that they create, this is in the form of messages and this forms this nice graph where the left

19:49.480 --> 19:55.560
side is a network and the right side is a bunch of trees. So what can we do in this? A very simple

19:55.960 --> 20:02.120
simplified example query is this, we have two parameters a name and the day and what we're

20:02.120 --> 20:07.480
interested in is that starting from a person with a given name, if we do two hops along the

20:07.560 --> 20:13.480
nosages, friends and friends of friends and then we check the messages that those persons

20:13.480 --> 20:19.240
authored. What are the messages that were created before a given day? So let's go through this with

20:19.240 --> 20:25.160
an example. Let's say we are interested in ban and messages before Saturday. In this case we do

20:25.160 --> 20:31.880
one example along the nosages to get to Ada and Carl, one more to get to Finn, Eve and then

20:31.880 --> 20:39.320
we hop over to these messages and we get a filtering operation to get the ones that were created

20:39.320 --> 20:46.920
before Saturday. So visually it's very simple but for the database it means efficient lookups,

20:46.920 --> 20:54.040
joins filtering, so it's quite an involved operation. And of course updates are also interesting

20:54.040 --> 20:59.400
and in the graph space this is even more so. We can do simple inserts that's not that special,

20:59.400 --> 21:06.040
we just add the nose edge between Eve and Gia or we decide that Gia is creating a reply to message

21:06.040 --> 21:13.160
M3 on Sunday. This is pretty run of the real stuff but where graph databases can be interesting

21:13.160 --> 21:19.400
is when it comes to deep delete operations. So for example if you decide to delete the person Eve,

21:20.200 --> 21:25.960
then that could be only deleting the node and the outgoing edges but if you are privacy

21:25.960 --> 21:32.200
conscious and you need to delete all the data that person ever created that also means removing

21:32.200 --> 21:37.240
all the comments that she ever wrote and maybe the replies to those comments and the likes to

21:37.240 --> 21:43.000
those replies. So it can be kind of a replaying effect and we just removed one node and ended up missing

21:43.000 --> 21:48.200
almost half our example graphs. So deletes in graphs are quite heavy eating operations.

21:48.280 --> 21:56.760
How does this all add up into a benchmark workload? Well we have a preprocessing phase where

21:56.760 --> 22:02.520
you can tweak the data to your liking, then you load it into your database, do a 30-minute warm-up

22:02.520 --> 22:08.120
and then a two-hour benchmark run. The goal of having this two and a half hour period is to have

22:08.120 --> 22:13.720
some realistic representation of an actual life system where the database is running for a

22:13.720 --> 22:21.160
prolonged period of time and not just running a bunch of queries. The workload mix composes of

22:21.160 --> 22:25.400
three different types of operators. We have complex reads like the one I have shown you,

22:25.800 --> 22:33.400
we have short reads like looking up a certain message or getting the friends of another person

22:33.400 --> 22:39.080
and then we have the update operators, the inserts and the deletes. And the way we replay this

22:39.160 --> 22:44.840
is actually quite interesting. So in the generator we simulate a three-year period of the social

22:44.840 --> 22:51.640
network's activity. We write down the timestamps and then we replay them. We try to compress time

22:51.640 --> 22:57.320
based on something called a total compression ratio. So for example we have four operators here,

22:57.320 --> 23:03.400
along some logical assimilation time. And then if we have two threads to run the benchmark,

23:03.400 --> 23:11.480
we can compress them by putting updates one and two on thread one and update three on thread two.

23:11.480 --> 23:15.960
It's actually more involved than this because there are all sorts of data dependencies. For example,

23:15.960 --> 23:21.560
you cannot like a message before a message was created. So those have to be tracked by the benchmark

23:21.560 --> 23:26.280
driver. But once you're done with the tracking, you can do the scheduling. And then we schedule

23:26.360 --> 23:36.200
everything else to follow these update operations. For example, we do maybe 10 updates between

23:36.200 --> 23:42.440
each query one, 15 updates between each query two and so on. And then the queries themselves also

23:42.440 --> 23:47.960
send off kind of a short read repo. This is to simulate that you have written a message,

23:47.960 --> 23:52.680
but now you're maybe looking around in the social network to have something more to do.

23:53.160 --> 24:01.560
If you start running an actual database and start hitting it with this workload,

24:01.560 --> 24:06.360
what you will notice after some time is that maybe the database is not able to keep up

24:06.360 --> 24:13.080
with your workload. So the start times of the queries will start drifting back in time. And obviously

24:13.080 --> 24:17.480
this is not great. What we want is that the database should be able to keep up. So we have this

24:17.480 --> 24:24.520
on-time requirement that 95% of the executive queries must start within one second of their schedule

24:24.520 --> 24:32.360
time. Otherwise, it's an invalid run. All right. So we have the data set the queries, the updates,

24:32.360 --> 24:37.640
and the workload mix, how do we do all that? Well, as I said, we had strict auditing guidelines

24:37.640 --> 24:45.480
similar to TPC. We have a list of certified auditors that you can work with. And unlike TPC,

24:45.480 --> 24:51.400
here the auditors are always rerunning the benchmarks, and they are writing the full disclosure

24:51.400 --> 24:57.800
report with all the details. One sign of success that we have noticed a few years ago is that people

24:57.800 --> 25:03.960
told us that they have seen these RFPs, requests for proposals for internal critical procurement

25:03.960 --> 25:11.160
in companies that necessitated official ADBC benchmark results. So as a benchmark organization,

25:11.160 --> 25:16.360
you sometimes get to become the arbitrator of whether a system gets picked or whether another system

25:16.360 --> 25:23.560
gets picked. You can see a list of audited results on our web page and just create a similar

25:23.560 --> 25:31.000
plot to the CPC-H plot. Here is how our transactional workload ran on SCF Actor 300,

25:31.000 --> 25:37.560
the 300 give you by data set over the last five or six years. What you can see is that it started

25:37.560 --> 25:46.360
from around 4,000 operations per second, and it grew to about 30x to more than 130,000 operations per

25:46.360 --> 25:51.960
second. What you can also see is that these are all vendors based in China. Apparently China has a

25:51.960 --> 26:00.440
very healthy venture capital ecosystem for funding projects like this, and these companies were basically

26:00.440 --> 26:05.000
one-up-ing each other in terms of performance. So this kind of driving competition thing

26:05.000 --> 26:12.840
really works with benchmarks. We did quite a few audits, so I have a few lessons to share about this.

26:12.840 --> 26:18.520
One thing that we have noticed is that performance variability is huge, especially in the cloud.

26:18.520 --> 26:25.400
We had issues because of limited disk space, we had issues that you get a different machine,

26:25.400 --> 26:30.280
and the performance of the IO is widely different, but we also had issues where it was a

26:30.280 --> 26:36.680
machine that was good. It was enough space on the disk, but on the second run it just had an

26:36.680 --> 26:42.600
inexplicable performance drop. This is obviously something that the auditor needs to take into account.

26:43.160 --> 26:48.600
We also noticed that sometimes never misaligned expectations because the vendors expected the auditor

26:48.600 --> 26:55.720
to finish the implementation, and in general the audits always took much longer than expected,

26:55.720 --> 27:02.360
sometimes three or four months. We tried to mitigate this by having an elaborate questioner

27:02.360 --> 27:07.480
of dozens of questions that we do during the pre auditing process, but this is still a

27:07.480 --> 27:14.920
difficult challenge to get the audits done on time. A few words about organization, so obviously

27:14.920 --> 27:19.720
running such an operation really needs some sort of an organization structure, some sort of

27:19.720 --> 27:26.920
financing, and so on. ADBC had four distinct phases in its lifespan. It was originally

27:26.920 --> 27:31.960
an EU project with almost three million euros of funding. Then it costed for a few years,

27:31.960 --> 27:37.720
most living off of favors of vendors and academic institutes. Then we actually started collecting

27:37.720 --> 27:45.480
money from our members, maybe $2,000 from regular members, so we had small fees, and from last

27:45.480 --> 27:53.080
year we up this, and now we have had the 150,000 euro per year income. This looks nice, but it

27:53.080 --> 28:00.200
actually has quite a few problems, and the problem is banking. We were kind of doing rounds at different

28:00.200 --> 28:05.320
banks, and ended up settling on one of these new FinTech nail banks that you use from your phone.

28:05.880 --> 28:12.280
Neobank one kicked us out without explanation. We were moving our funds to Neobank 2,

28:12.840 --> 28:18.280
and they blocked our account, and they triggered an urgent and extensive KYC,

28:18.280 --> 28:23.640
Neoyor customer check, because they said we have an unusual structure, or they didn't say that,

28:23.640 --> 28:31.320
but the reason was clearly this. What this means is that they really don't like organizations

28:31.320 --> 28:39.320
that have 25 directors from more than 10 different companies. We passed this check, but then we did

28:39.320 --> 28:45.000
the restructuring of the organization, and we were happy for a few years, until they said,

28:45.000 --> 28:51.400
we cannot accept money from your member X. In the last couple of weeks, I tried to migrate to Neobank

28:51.400 --> 28:57.400
3, they denied our application straight away, so you can see it's kind of difficult to do. Maybe

28:57.400 --> 29:02.600
we will go to London into a traditional brick and mortar bank and open an actual account,

29:02.600 --> 29:05.960
with an actual person instead of fighting support. That is to be seen.

29:06.840 --> 29:12.680
So it is a difficult thing to run a benchmark organization, and it's also difficult to have a value

29:12.680 --> 29:19.560
proposition. We found that vendors join us to have a seat at the table when the benchmarks are designed,

29:19.560 --> 29:24.840
and one particularly interesting reason is the defensive reason they want to prevent the benchmarks

29:24.840 --> 29:31.000
to be used against them. So they are there to prevent a new benchmark being defined that puts them

29:31.000 --> 29:37.320
at a competitive disadvantage. Interestingly, many of our vendors say, yes, benchmark for a nice,

29:37.320 --> 29:42.600
they have maybe achieved their goals, but now should focus more on queer languages,

29:42.600 --> 29:48.200
graph schema, and de-emphasize the benchmarking part, and just keep some data sets and synthetic data

29:48.200 --> 29:54.920
generators. To reflect this, LDBC has actually rebranded as GDC, so we are now called the

29:54.920 --> 30:01.640
graph data console, and we still do benchmarks, but it's more kind of overarching theme on graphs

30:01.640 --> 30:07.880
instead of just a benchmark organization. Alright, popular benchmarks. I have two popular benchmarks

30:07.880 --> 30:13.960
to talk about. One is clickbench. This is something that's a macro benchmark for analytical

30:13.960 --> 30:20.600
database systems. This is something that's designed and maintained by Clickhouse, and this is designed

30:20.600 --> 30:26.760
to be an easy-to-use benchmark for analytical workloads. The data set has a single bite table of about

30:26.760 --> 30:35.480
100 million rows, and 105 attributes, so 10 billion plus sales in the table. It's 75 gigabytes in CSV,

30:35.480 --> 30:43.240
and it's webinars coming from the index metric data set. The queries focus on scan operations,

30:43.240 --> 30:47.480
aggregations, and lookups. You can see a bunch of queries starting from the simple

30:47.480 --> 30:54.520
comes star through some filtering, through more group-by-aggregate queries. Clickbench is

30:54.520 --> 30:59.240
philosophy, as much as I can tell, is to strive for simplicity, so there are no scare factors,

30:59.240 --> 31:05.960
no career parameters, only simple type, and very few restrictions on the implementation. The whole

31:05.960 --> 31:12.920
benchmark is kind of fast-paced, and you should be able to do a git clone, go to your directory,

31:12.920 --> 31:17.240
do the benchmark, and finish within 20 minutes, including downloading the data set.

31:17.480 --> 31:21.000
Then you can open a poor request, and they usually review it within a few days.

31:22.920 --> 31:28.120
This allowed Clickbench to scale up in terms of the results quite quickly. They have more than 75

31:28.120 --> 31:35.400
different systems, with 125 different configurations, and I think this is a massive success in terms

31:35.400 --> 31:42.040
of stimulating competition and just getting people their results out there. They have somewhat

31:42.040 --> 31:46.920
standardized hardware, so there are a few common setups that are at the center of the competition,

31:46.920 --> 31:52.760
but you're welcome to bring your own setup maybe on premises, laptop, GPU, you name it.

31:54.120 --> 31:59.480
If you open Clickbench, they have this elaborately, their board, and if you zoom in, it allows you

31:59.480 --> 32:04.040
to compare systems. I picked my SQL and Postgres, because none of them has a reason about

32:04.040 --> 32:10.840
chance of being good on a fully analytical benchmark, and you can see that you can compare their

32:10.920 --> 32:16.280
performance, you can compare their system sizes, the database size, that is, the size of

32:16.280 --> 32:23.720
to loading, the load times, the individual query times, and so on. Another benchmark that is reasonably

32:23.720 --> 32:30.680
popular is called H2O.AI. This was originally created by Anguretsky, and funded by H2OAI.

32:30.680 --> 32:35.720
Then it was abandoned, and now fork is maintained by dark db labs, where I've worked during my

32:36.680 --> 32:42.200
day job. It has two workloads, focusing on group buys and joins, and three scare factors, 0.5,

32:42.200 --> 32:47.000
5, and 50 gigabytes. You may have still floating around on social media, it has this really

32:47.000 --> 32:53.240
recognizable color for squares, and these long bars representing the first runtime for the code

32:53.240 --> 33:00.280
run, and the second runtime for the vorm run. This is a synthetic data generator, and the workloads

33:00.280 --> 33:07.560
themselves are really simple. So, five basic groupings, five advanced groupings, switching between

33:07.560 --> 33:12.920
low and high cardinality, grouping on different types of strings, and so on. And there's a joint

33:12.920 --> 33:19.400
workload, it's five different joint periods. We are the maintainers of this benchmark, and we keep

33:19.400 --> 33:25.400
getting proposals for extending it with different functionality like window functions. This is actually

33:25.400 --> 33:30.600
becoming a dilemma for us. What do we do with it? If we keep as it is, then it's slowly going to

33:30.600 --> 33:36.600
become a stair benchmark. If we extend, then people will say, well, you just accept the proposals

33:36.600 --> 33:42.440
that make your system look good. Again, you know, an independent benchmark authority would be nice,

33:42.440 --> 33:48.760
it's kind of difficult to maintain a benchmark as a vendor. So, to kind of run it out,

33:49.320 --> 33:55.000
the takeaways. I think benchmarks do carry tremendous value, they capture a common understanding

33:55.000 --> 34:01.480
of your field, they drive innovation, but they are difficult to finance, and even though vendors

34:01.480 --> 34:06.440
use them extensively, a lot of that is internal use, and it's difficult to get funding for it.

34:07.400 --> 34:12.280
Here's my idea of benchmark in one slide. If you ever do that, maybe it's worth visiting,

34:12.280 --> 34:17.320
and I would just like to leave it the fact that you maybe shouldn't bring your own build your own

34:17.320 --> 34:23.240
benchmark organization from scratch. AdPC or GDC is open, please reach out if you would like to learn

34:23.240 --> 34:27.320
more. Thank you.

