WEBVTT

00:00.000 --> 00:07.000
So, all right, there we go.

00:07.000 --> 00:10.000
Hey everybody, thank you very much.

00:10.000 --> 00:13.000
Wow, it's loud.

00:13.000 --> 00:15.000
Free applause.

00:15.000 --> 00:16.000
I like it.

00:16.000 --> 00:18.000
Hey everybody, good.

00:18.000 --> 00:23.000
So, I don't know if anyone recognizes what movie this is from.

00:23.000 --> 00:28.000
So, dog day afternoon, because hostage situation, because the title of the talk.

00:28.000 --> 00:29.000
Bad pun.

00:29.000 --> 00:30.000
I know.

00:30.000 --> 00:31.000
Anyways.

00:31.000 --> 00:47.000
So, in 2017 at VLDB, Honda's Mielsen and Mark Roosevelt gave a talk and had this paper about,

00:47.000 --> 00:49.000
don't hold my data hostage.

00:49.000 --> 00:57.000
Specifically talking about the details and problems with the formats that are used for query,

00:58.000 --> 00:59.000
retrieval.

00:59.000 --> 01:03.000
And this was a big deal when they gave it.

01:03.000 --> 01:09.000
And lots of data systems are adopting what they learned, I'm going to go over.

01:09.000 --> 01:12.000
But it's still need more.

01:12.000 --> 01:23.000
And so, a lot of what the paper is about is that databases have their own client server protocols constantly.

01:24.000 --> 01:31.000
And there's three, there's a few specific points that the day paper makes.

01:31.000 --> 01:40.000
That if you benchmark query result retrieval in the various protocols of major day systems, they're slow.

01:40.000 --> 01:49.000
And in fact, while we focused so much over the years on query execution and planning and optimization,

01:49.000 --> 01:54.000
the actual protocols used for query retrieval for results have kind of stagnated.

01:54.000 --> 01:58.000
And end up being the bottleneck very frequently.

01:58.000 --> 02:07.000
And so, they proposed a novel client protocol that solved the problems they were observing.

02:07.000 --> 02:13.000
And then they benchmark that against a bunch of the major day systems.

02:13.000 --> 02:20.000
And the central observation is that query result retrieval is expensive.

02:20.000 --> 02:29.000
And largely, it's expensive because of the overhead done by serializing and decelerizing those results,

02:29.000 --> 02:30.000
the data.

02:30.000 --> 02:36.000
As you're passing it from the server, the database to whatever client, you're serializing it over there,

02:36.000 --> 02:39.000
decelerizing it in the client, over and over again.

02:39.000 --> 02:47.000
And that overhead becomes the bottleneck and makes it very expensive.

02:47.000 --> 02:53.000
And so, they tested a few different things.

02:53.000 --> 02:59.000
And they examined data in a row oriented formats and calmering to format,

03:00.000 --> 03:07.000
and row oriented serializations cause a number of different inefficiencies.

03:07.000 --> 03:14.000
If you compare row oriented formats to calm formats, like we use calm oriented formats internally of databases,

03:14.000 --> 03:19.000
constantly for vectorization purposes, simply in all the benefits they're in.

03:19.000 --> 03:23.000
But when you're talking about serializing the data over the wire,

03:23.000 --> 03:28.000
everyone still constantly ends up using row oriented serialization formats,

03:28.000 --> 03:33.000
even though they don't compress as well as a calmering to format.

03:33.000 --> 03:43.000
And more importantly, both the source system and the client system are increasingly calmer oriented internally,

03:43.000 --> 03:48.000
which means that because you're using a row oriented transport,

03:48.000 --> 03:54.000
the source system has to transpose all that data into rows to pass across the network.

03:54.000 --> 04:00.000
And then your client, transposes it all back into columns over and over again.

04:00.000 --> 04:06.000
Instead of just leaving it in the column oriented format the whole way through,

04:06.000 --> 04:09.000
because that would make more sense wouldn't it?

04:09.000 --> 04:17.000
And so they look at their novel protocol that was calmer oriented.

04:17.000 --> 04:22.000
And then they also tested it against various bandwidth conditions,

04:22.000 --> 04:28.000
because it's not only about the size of the data payloads.

04:28.000 --> 04:32.000
It's also about how it deals with the various bandwidth conditions.

04:32.000 --> 04:45.000
And what they found was heavyweight compression only gives you a speed-up when the connection between your client server are very slow.

04:45.000 --> 04:50.000
If the connection is actually very fast,

04:50.000 --> 05:01.000
the overhead of performing the compression is more expensive than any slowdown because the payloads large.

05:01.000 --> 05:13.000
Which means that lightweight compressions are much more effective at speeding up a treeval at the network speeds that were common back in 2017,

05:13.000 --> 05:18.000
which have only gotten faster in the almost the decade since,

05:18.000 --> 05:27.000
which means that if your connection is super fast and they tested it on your specific loopback local connection,

05:27.000 --> 05:31.000
your best bet is just don't even bother with the compression.

05:31.000 --> 05:42.000
Because if the network's fast enough, any actual cost of performing the compression and decompression is more expensive than the transport,

05:42.000 --> 05:47.000
transport cost.

05:47.000 --> 05:53.000
So of course their novel protocol made compression optional.

05:53.000 --> 05:58.000
And so when you compare the various timings and the benchmark,

05:58.000 --> 06:04.000
their protocol against the common ones the day.

06:04.000 --> 06:09.000
And then you can see in bold the best performing scenario here,

06:09.000 --> 06:16.000
which was their protocol was the MonetDB because they would go on to make ductDB anyways.

06:16.000 --> 06:26.000
So the novel protocol was faster than any of the existing database protocols that were existing at the time.

06:26.000 --> 06:36.000
Because of all of these benefits that you get from a column oriented client server protocol where compression is optional,

06:36.000 --> 06:42.000
where you enable all of these things on arbitrarily fast connections,

06:42.000 --> 06:51.000
and even that slow connections it's still more efficient to perform this way.

06:51.000 --> 06:58.000
And so they wrap up the paper emphasizing that these row-based protocols,

06:58.000 --> 07:01.000
which were originally designed in a very different era,

07:01.000 --> 07:06.000
are just not suitable for modern data analytic needs,

07:06.000 --> 07:10.000
and they're increasingly a poor decision.

07:10.000 --> 07:15.000
And yet we're still using them nearly a decade later.

07:15.000 --> 07:18.000
So that was in 2017.

07:18.000 --> 07:21.000
In 2016 just before that,

07:21.000 --> 07:26.000
the Pachiaro became a top level project of the Pachian Foundation.

07:27.000 --> 07:30.000
Arrow was brand new.

07:30.000 --> 07:37.000
We don't know if Honness and Mark knew about Arrow at the time they wrote the paper,

07:37.000 --> 07:44.000
but even if they did, Arrow was so new that it was unclear whether or not Arrow actually offered the solutions.

07:44.000 --> 07:49.000
But what they described, their novel protocol,

07:49.000 --> 07:56.000
was for all intents and purposes, Arrow pretty much.

07:56.000 --> 07:58.000
There's a few minor differences here and there,

07:58.000 --> 08:01.000
but for all intents and purposes,

08:01.000 --> 08:08.000
a compression optional batch column oriented format for data transportation,

08:08.000 --> 08:12.000
communication, and for data computation,

08:12.000 --> 08:17.000
that has almost no serialized nation overhead.

08:18.000 --> 08:21.000
The bytes on the wire are the bytes in memory,

08:21.000 --> 08:23.000
small little flap buffer header,

08:23.000 --> 08:26.000
but for the most part, the bytes are as they are,

08:26.000 --> 08:29.000
compression is optional, and so on.

08:29.000 --> 08:35.000
And so in the earliest days, Arrow was most widely used

08:35.000 --> 08:41.000
as that kind of serialization format where you don't need to serialize.

08:42.000 --> 08:47.000
For transporting data between different systems and across networks,

08:47.000 --> 08:49.000
it started low level,

08:49.000 --> 08:54.000
and gradually you went from the format to the CABI

08:54.000 --> 08:56.000
for interoperability,

08:56.000 --> 09:02.000
to flight RPC, a framework for client server query protocols,

09:02.000 --> 09:07.000
to flight SQL for the full protocol for database interactions,

09:08.000 --> 09:11.000
to ADBC, which is more recent,

09:11.000 --> 09:15.000
which is basically OUBC, but better.

09:15.000 --> 09:19.000
You know, similar idea, you have single client API,

09:19.000 --> 09:22.000
and swapable drivers implemented,

09:22.000 --> 09:25.000
but it's all using Arrow as the format for the data,

09:25.000 --> 09:29.000
so you get column oriented data the whole way through.

09:29.000 --> 09:32.000
So when you're talking about dealing with back to databases here,

09:32.000 --> 09:36.000
your client server protocol can keep the data in

09:36.000 --> 09:39.000
the column oriented format, the whole way through,

09:39.000 --> 09:44.000
especially when your source system is column oriented internally.

09:44.000 --> 09:47.000
And then gradually, you know,

09:47.000 --> 09:53.000
now it's much clearer and better known that Arrow is what it is.

09:53.000 --> 09:57.000
I saw several people clap when I saw it in the picture up there

09:57.000 --> 10:00.000
of the Arrow logo, and so on, many people know what it is,

10:00.000 --> 10:03.000
but also many people don't know what Arrow is,

10:03.000 --> 10:05.000
because it's such a low level technology.

10:05.000 --> 10:09.000
Arrow is in almost everything already.

10:09.000 --> 10:13.000
If you're using Polars, Polars is Arrow under the hood.

10:13.000 --> 10:17.000
If you're using pandas, it's basically Arrow under the hood,

10:17.000 --> 10:20.000
especially with the most recent release.

10:20.000 --> 10:23.000
Dr. Bee is, for all intent and purposes,

10:23.000 --> 10:26.000
internal memory is essentially Arrow,

10:26.000 --> 10:29.000
and so on and so on.

10:29.000 --> 10:34.000
And last year, me and my co-founders of Kalmar put out

10:34.000 --> 10:41.000
a blog post explaining and punctuating a lot of what I've said

10:41.000 --> 10:43.000
at the beginning of the talk.

10:43.000 --> 10:47.000
And if you're interested, go read the blog post,

10:47.000 --> 10:49.000
lots of visuals, lots of cool things,

10:49.000 --> 10:52.000
talking about modern day protocols,

10:52.000 --> 10:56.000
specifically talking about Postgres in that article.

10:56.000 --> 10:59.000
So if you're one of more in-depth attacks here,

10:59.000 --> 11:01.000
go ahead and read.

11:01.000 --> 11:08.000
The other part is that when they concluded that the conclusions

11:08.000 --> 11:12.000
that they reached have only gotten more important

11:12.000 --> 11:16.000
as your average network speeds have gone up.

11:16.000 --> 11:21.000
In 2017, the average global network speed for broadband

11:21.000 --> 11:23.000
was about 39 megabits.

11:23.000 --> 11:27.000
In 2023, your average global system was 110,

11:27.000 --> 11:29.000
and it's only gotten bigger.

11:29.000 --> 11:34.000
And so the faster the system, your network speed is,

11:34.000 --> 11:38.000
the more you need a compression optional batched Kalmar-oriented

11:38.000 --> 11:42.000
format to take advantage of that speed.

11:42.000 --> 11:45.000
You're just throwing performance away

11:45.000 --> 11:50.000
by converting everything into JSON and rows and so on.

11:50.000 --> 11:53.000
And so this has been a real triumph for Arrow.

11:53.000 --> 11:56.000
Because Arrow is a format, it's a spec.

11:56.000 --> 12:02.000
It has libraries in pretty much every language under the sun.

12:02.000 --> 12:05.000
If you are using a programming language,

12:05.000 --> 12:08.000
there is probably an Arrow implementation for it.

12:08.000 --> 12:10.000
But because the format is identical,

12:10.000 --> 12:13.000
no matter what language you're using,

12:13.000 --> 12:15.000
everything's all interoperable.

12:15.000 --> 12:18.000
And many systems have adopted it.

12:18.000 --> 12:22.000
You can see here at mean.tv is completely interoperable with Arrow.

12:22.000 --> 12:27.000
Dremio, Google BigQuery, you can get from the Read Storage API

12:27.000 --> 12:29.000
and return Arrow.

12:29.000 --> 12:33.000
Click out Snowflake, all return Arrow, and so on.

12:33.000 --> 12:37.000
In fact, half the time they do that because their clients

12:37.000 --> 12:40.000
are demanding it, because it's immediately interoperable with

12:40.000 --> 12:43.000
all the downstream pollers, pandas,

12:44.000 --> 12:51.000
versus front end visualization systems

12:51.000 --> 12:54.000
like perspective and so on.

12:54.000 --> 12:57.000
But at the same time, there's lots of systems

12:57.000 --> 12:59.000
that aren't adopting it.

12:59.000 --> 13:02.000
That are still oriented there.

13:02.000 --> 13:07.000
Still oriented protocols that are only returning JSON,

13:07.000 --> 13:08.000
even though they're client.

13:08.000 --> 13:12.000
Even though they're column oriented internally and so on.

13:12.000 --> 13:16.000
And so you think about, okay, why?

13:16.000 --> 13:18.000
Is it Stockholm syndrome?

13:18.000 --> 13:20.000
Why?

13:20.000 --> 13:26.000
And what comes down to is it's a really high switching cost?

13:26.000 --> 13:28.000
It's a steep learning curve.

13:28.000 --> 13:30.000
Whether if you're writing a new code,

13:30.000 --> 13:34.000
then you have to reskill and learn how to think about things

13:34.000 --> 13:35.000
in this way.

13:35.000 --> 13:37.000
If you're adding to an existing system,

13:37.000 --> 13:41.000
it requires migrating your system to this new way of doing things.

13:41.000 --> 13:48.000
So it's not an easy thing to convert from

13:48.000 --> 13:52.000
our usual understanding of row oriented stuff

13:52.000 --> 13:55.000
to utilizing arrow and commoying stuff.

13:55.000 --> 13:58.000
If that's not what you're already familiar with.

13:58.000 --> 14:02.000
And kind of like help visualize this.

14:02.000 --> 14:08.000
Anyone familiar with what this keyboard layout is?

14:08.000 --> 14:11.000
It's the Vorac.

14:11.000 --> 14:15.000
You might be more familiar with Quarty.

14:15.000 --> 14:20.000
And then there's also a lovely other format

14:20.000 --> 14:23.000
that is let's known called Colmac.

14:23.000 --> 14:26.000
And so think about it this way.

14:26.000 --> 14:29.000
You're used to Quarty or you're used to the Vorac, whatever.

14:29.000 --> 14:34.000
And so in puts one of the other keyboard layouts in front of you

14:34.000 --> 14:37.000
that are very, very different.

14:37.000 --> 14:42.000
It is not easy to rewire your brain

14:42.000 --> 14:46.000
after spending decades typing on a Quarty keyboard

14:46.000 --> 14:49.000
to suddenly use the Vorac.

14:49.000 --> 14:53.000
And that's kind of the same system we're talking about here

14:53.000 --> 14:56.000
with the difficulty of the conversion

14:56.000 --> 15:00.000
from these row oriented paradigms to column oriented ones.

15:00.000 --> 15:04.000
And also also consider the fact that

15:05.000 --> 15:12.000
the experts on all this stuff always overestimate

15:12.000 --> 15:18.000
the actual understanding of the non-experts.

15:18.000 --> 15:24.000
No matter how much you think that you're accounting for

15:24.000 --> 15:27.000
what a non-expert might know.

15:27.000 --> 15:31.000
If you're an expert in the system, you're probably overestimating

15:31.000 --> 15:34.000
what they're understanding is.

15:34.000 --> 15:39.000
And so one of the things that the arrow community and that

15:39.000 --> 15:43.000
columnar and general are trying to do is we want to

15:43.000 --> 15:46.000
shrink this learning curve.

15:46.000 --> 15:50.000
We want to make this feel much more gentle and make it easier.

15:50.000 --> 15:52.000
And then you think about, you know,

15:52.000 --> 15:54.000
talk about since we're talking about DS systems.

15:54.000 --> 15:56.000
And we talk about, you know,

15:56.000 --> 15:59.000
the problems of ODBC and JDBC that were designed,

15:59.000 --> 16:03.000
you know, 20 years ago, 30 years ago and everybody

16:03.000 --> 16:06.000
hates them and they're awful.

16:06.000 --> 16:11.000
And I mentioned arrow as ADBC for communication of data

16:11.000 --> 16:14.000
stuff that way, of dealing with databases.

16:14.000 --> 16:19.000
And so we have an example here of simplifying this data

16:19.000 --> 16:23.000
transfer with arrow under the hood.

16:23.000 --> 16:27.000
And you can use this little CLI that columnar made open source

16:27.000 --> 16:32.000
and you can just install the driver for the system of your choice

16:32.000 --> 16:33.000
that we have.

16:33.000 --> 16:38.000
We've got about 10, 12, all open source drivers and a couple

16:38.000 --> 16:40.000
of non-open ones.

16:40.000 --> 16:42.000
And then you can just, in this case,

16:42.000 --> 16:45.000
I'm using Python, but we have a driver

16:45.000 --> 16:51.000
managers for go, C++, R, Rust, whatever.

16:51.000 --> 16:55.000
And then you just load the driver and it just works.

16:56.000 --> 16:58.000
And you get arrow data out of it,

16:58.000 --> 17:00.000
no matter what a source system is.

17:00.000 --> 17:04.000
But if that source system can support

17:04.000 --> 17:07.000
out-putting column oriented data,

17:07.000 --> 17:10.000
you get all of the performance benefits.

17:10.000 --> 17:14.000
You get this significantly faster communication,

17:14.000 --> 17:17.000
query result retrieval, interoperability,

17:17.000 --> 17:20.000
connection with your existing tooling.

17:20.000 --> 17:23.000
If you look at, you know, last year, Power BI

17:24.000 --> 17:27.000
from Microsoft had their new snowflake connector.

17:27.000 --> 17:29.000
It's just the ADVC driver.

17:29.000 --> 17:31.000
We can have for two weeks ago,

17:31.000 --> 17:33.000
they announced Power BI using a new database

17:33.000 --> 17:35.000
data bricks connector.

17:35.000 --> 17:38.000
It's just the ADVC driver.

17:38.000 --> 17:41.000
Because arrow is eating the world,

17:41.000 --> 17:44.000
but it's such a low-level technology that

17:44.000 --> 17:46.000
you don't even realize you're using it

17:46.000 --> 17:49.000
if you're not the one building these systems.

17:49.000 --> 17:52.000
But you have to know about it,

17:52.000 --> 17:56.000
tester whatever systems you're using

17:56.000 --> 17:58.000
to add support for it,

17:58.000 --> 18:01.000
so that you can then benefit.

18:01.000 --> 18:03.000
And so, you know,

18:03.000 --> 18:07.000
check out our little CLI for managing drivers.

18:07.000 --> 18:09.000
Check, learn about it.

18:09.000 --> 18:12.000
Learn about, you know, ADVC and arrow.

18:12.000 --> 18:14.000
Lots of documentation.

18:14.000 --> 18:17.000
Reach out to columnar, reach out to me.

18:17.000 --> 18:18.000
Reach out to the arrow community.

18:18.000 --> 18:22.000
You know, whatever level of participation you want to do.

18:22.000 --> 18:26.000
But the important point here is understanding,

18:26.000 --> 18:32.000
let's stop using row-based formats for query retrieval.

18:32.000 --> 18:35.000
Let's stop making everything JSON,

18:35.000 --> 18:37.000
even when it doesn't need to be.

18:37.000 --> 18:39.000
Because yeah, it's human-reable and easy,

18:39.000 --> 18:42.000
but that's what cost.

18:42.000 --> 18:43.000
And that's it.

18:43.000 --> 18:45.000
So, thanks everybody.

18:46.000 --> 18:53.000
I think I've got like two minutes for questions.

18:53.000 --> 18:59.000
So, the question was,

18:59.000 --> 19:05.000
if we're spending things out of the data system,

19:05.000 --> 19:08.000
calm oriented, to dump it into a web app,

19:08.000 --> 19:13.000
is there anything on the web app?

19:13.000 --> 19:16.000
So, the question was,

19:16.000 --> 19:18.000
if we're spending things out of the data system,

19:18.000 --> 19:21.000
calm oriented, to dump it into a web app,

19:21.000 --> 19:24.000
is there anything on the web app side that takes

19:24.000 --> 19:27.000
advantage of the calm oriented system?

19:27.000 --> 19:32.000
And the answer to that is, yeah, arrow.

19:32.000 --> 19:36.000
There is the JavaScript, arrow library.

19:36.000 --> 19:40.000
There are several visualization libraries,

19:40.000 --> 19:44.000
perspective of Falcon,

19:44.000 --> 19:47.000
which exist, which leverage arrow data

19:47.000 --> 19:50.000
for their front end visualizations.

19:50.000 --> 19:53.000
And more on that coming soon,

19:53.000 --> 19:56.000
columnar, the company co-founder of,

19:56.000 --> 20:01.000
we are working on a wasom,

20:01.000 --> 20:05.000
no situation for ADBC directly,

20:05.000 --> 20:08.000
but more on that soon.

20:08.000 --> 20:13.000
I can do, look, am I ahead of time?

20:13.000 --> 20:14.000
I'm out of time.

20:14.000 --> 20:16.000
Thank you very much.

20:20.000 --> 20:24.000
Thank you very much.