WEBVTT

00:00.000 --> 00:07.000
Okay.

00:07.000 --> 00:13.000
Hi everyone, my name is Michiola Becker, and this is my colleague Gassan Mohamed,

00:13.000 --> 00:15.000
we are both from Twintag.

00:15.000 --> 00:18.000
At Twintag, we do what we call connected products.

00:18.000 --> 00:23.000
Basically, we bring data to physical objects in the real world.

00:23.000 --> 00:27.000
As part of that, we do have a need to handle data very flexibly.

00:27.000 --> 00:32.000
We don't see ourselves as a database vendor, more of a database user,

00:32.000 --> 00:35.000
but because of this flexibility, we do some fun stuff,

00:35.000 --> 00:38.000
and that's what we want to present to you today.

00:38.000 --> 00:44.000
So, we already had a very nice overview of what Arrow can do earlier today,

00:44.000 --> 00:45.000
I think from Matthew.

00:45.000 --> 00:47.000
So, I'm not going to go into this in depth,

00:47.000 --> 00:50.000
but there's basically two things to remember.

00:50.000 --> 00:53.000
Arrow is a in-memory format for common storage,

00:53.000 --> 00:57.000
and then we have the wire protocol, Arrow West Rail Flight,

00:57.000 --> 01:01.000
to put stuff over the wire.

01:01.000 --> 01:03.000
So, what do we want to do in our business?

01:03.000 --> 01:07.000
Basically, we have a whole bunch of data being stored in different formats internally,

01:07.000 --> 01:11.000
but what we want to do is we want to give the simplicity to the end user,

01:11.000 --> 01:16.000
basically giving them one grace-purpose to interact with all of our,

01:16.000 --> 01:18.000
all of this data.

01:19.000 --> 01:23.000
And that means we need to solve for the challenges in between.

01:23.000 --> 01:28.000
So, mapping all this logical query access to the physical storage,

01:28.000 --> 01:31.000
federating all the queries across that,

01:31.000 --> 01:35.000
and then also the governments for the multi-tenancy and stuff in between.

01:35.000 --> 01:41.000
So, today we want to show you a little bit how we tackle this in our approach.

01:41.000 --> 01:45.000
This is the same picture with a little bit more detailed,

01:45.000 --> 01:49.000
and so on the way on the left you have some of the access patterns.

01:49.000 --> 01:54.000
We have SQL query like connectivity for our customers as well as APIs,

01:54.000 --> 01:57.000
and connected for the common BI tooling,

01:57.000 --> 02:00.000
which you can see here also is that it gives two different types of access patterns.

02:00.000 --> 02:05.000
We have the SQL and the BI tooling, which is more all up related,

02:05.000 --> 02:08.000
but then the API is then to be more transactional access patterns.

02:08.000 --> 02:12.000
So, we kind of have to solve for both in our stack.

02:13.000 --> 02:16.000
Yeah, you already see a little bit of hint of how we are tackling that.

02:16.000 --> 02:22.000
We did the site and using data fusion as our query engine.

02:22.000 --> 02:25.000
A little bit more on that decision later,

02:25.000 --> 02:30.000
and then downstream you see all these different sources from relational databases

02:30.000 --> 02:32.000
to stuff in object storage,

02:32.000 --> 02:35.000
where we basically take incoming queries,

02:35.000 --> 02:39.000
cut them up, and federate them through those systems.

02:42.000 --> 02:49.000
And, you know, it didn't open the slides with the Apache arrow slide for nothing.

02:49.000 --> 02:52.000
We originally started using that for doing some data pipelines,

02:52.000 --> 02:55.000
but now that we're doing this query federation,

02:55.000 --> 02:59.000
we're actually slowly trying to adopt arrow at every part of the stack,

02:59.000 --> 03:01.000
all the way from the connectivity,

03:01.000 --> 03:05.000
where we do have SQL flight connector to data fusion,

03:05.000 --> 03:07.000
which is arrow native basically,

03:07.000 --> 03:10.000
and then also to the downstream databases.

03:10.000 --> 03:19.000
Now, including TIDB, for which I have something as well later.

03:19.000 --> 03:23.000
So, data fusion, you also had a nice introduction already in the previous talk.

03:23.000 --> 03:25.000
What did we like about this?

03:25.000 --> 03:27.000
While we are a little bit more up in the stack,

03:27.000 --> 03:30.000
you know, designing services for our end customers.

03:30.000 --> 03:33.000
We ourselves are kind of still system designers,

03:33.000 --> 03:35.000
you know, not really database people,

03:35.000 --> 03:38.000
but we like to write services, you know,

03:38.000 --> 03:43.000
that the form button hole and not duct taped together with all bunch of scripts.

03:43.000 --> 03:45.000
So, being able to embed data fusion,

03:45.000 --> 03:49.000
and therefore have that control really aligned well with our goals there.

03:49.000 --> 03:54.000
It's also very well articulated by this slide that Andrew Lam made in one of his presentations.

03:54.000 --> 03:58.000
You see data fusion as the LVM of databases, you know,

03:58.000 --> 04:01.000
really a tool that you can customize and embed in your service

04:01.000 --> 04:03.000
to do what you need.

04:03.000 --> 04:05.000
You are kind of a couple of examples of that,

04:06.000 --> 04:08.000
as well as what we are doing.

04:08.000 --> 04:10.000
Probably way shout out to Andrew,

04:10.000 --> 04:13.000
because he's one of those gold theorem containers out there,

04:13.000 --> 04:17.000
what he did with fostering the data fusion community to what it is today.

04:17.000 --> 04:19.000
It's a really impressive.

04:21.000 --> 04:24.000
Normally, I don't like to add a lot of code into slides,

04:24.000 --> 04:27.000
but what I wanted to do here is just give you a little bit of a flavor

04:27.000 --> 04:30.000
of what it means to embed something like data fusion into your services.

04:30.000 --> 04:33.000
To kind of make it less scary for people,

04:33.000 --> 04:36.000
you do recommend that you try to do this yourself.

04:36.000 --> 04:38.000
So, the very basics of this is,

04:38.000 --> 04:41.000
okay, how do you teach data fusion, how to find your data,

04:41.000 --> 04:44.000
and that's where the stable provider trade comes in.

04:44.000 --> 04:47.000
You just describe what the schema of your data sources

04:47.000 --> 04:50.000
and how to talk to it, how to scan a record,

04:50.000 --> 04:54.000
and that's how you plug in the new data source into the system.

04:56.000 --> 05:00.000
And then, you know, in our earlier slide,

05:00.000 --> 05:03.000
we need to do a whole bunch of stuff to map between the external world,

05:03.000 --> 05:06.000
the simplicity for our users and the internal data systems,

05:06.000 --> 05:09.000
that includes stuff like remapping the names of things,

05:09.000 --> 05:11.000
you know, what's a table called,

05:11.000 --> 05:14.000
might be different for the end user than what's in storage, for example.

05:14.000 --> 05:16.000
And data fusion is really nice,

05:16.000 --> 05:20.000
but analyzer and optimize rules that you can ship to do things like,

05:20.000 --> 05:22.000
you know, these renaming, remapping, and all kind of stuff,

05:22.000 --> 05:25.000
very easy to plug into the engine as well.

05:26.000 --> 05:29.000
And finally, I think at face value,

05:29.000 --> 05:31.000
this might be a little bit underestimated,

05:31.000 --> 05:35.000
but this is just a way that you kind of spawn data fusion,

05:35.000 --> 05:38.000
how you set up what they call a session state,

05:38.000 --> 05:43.000
basically describing how one data data fusion instance should act,

05:43.000 --> 05:46.000
so what catalog there is, which analyzers to be used,

05:46.000 --> 05:50.000
where to find the data, and this makes it very easy to set that up,

05:50.000 --> 05:53.000
and you can do it dynamically, you know, depending on what request comes in,

05:54.000 --> 05:56.000
you load the configuration for that tenant,

05:56.000 --> 05:59.000
spawn the session state accordingly,

05:59.000 --> 06:01.000
and off you go to querying.

06:01.000 --> 06:03.000
So a little bit underestimated there,

06:03.000 --> 06:07.000
but very powerful for the kind of systems that we rebuild.

06:07.000 --> 06:09.000
All right, I said earlier in the talk,

06:09.000 --> 06:12.000
that we have kind of a hybrid access pattern,

06:12.000 --> 06:14.000
we have some more analytical stuff,

06:14.000 --> 06:16.000
but also a lot of access patterns are transactional,

06:16.000 --> 06:19.000
and one thing that happens is if you try to federate a query

06:19.000 --> 06:21.000
to a remote system,

06:21.000 --> 06:25.000
is that if you start doing fill database read for transactional systems,

06:25.000 --> 06:27.000
that's often very slow,

06:27.000 --> 06:31.000
what we need is to really get very specific data from the remote,

06:31.000 --> 06:34.000
and so we've contributed this extension to data fusion

06:34.000 --> 06:36.000
that allows you to do just that,

06:36.000 --> 06:38.000
and I'll just from a high level point of view,

06:38.000 --> 06:41.000
explain how it works, so you might have a plan like this,

06:41.000 --> 06:44.000
see on the right the plan, a couple of joins happening there,

06:44.000 --> 06:49.000
and it just so happens that on the right the circles,

06:49.000 --> 06:53.000
part of the plan is sitting in one remote database,

06:53.000 --> 06:55.000
so it could be tied to be, for example in this case,

06:55.000 --> 06:58.000
so what does data fusion Federation do?

06:58.000 --> 07:00.000
It analyzes the entire plan,

07:00.000 --> 07:02.000
figures out which suplands are actually provided

07:02.000 --> 07:05.000
by one single remote database mechanism system,

07:05.000 --> 07:07.000
it cuts that part out,

07:07.000 --> 07:10.000
and then it replaces it with an opaque note,

07:10.000 --> 07:12.000
and that it will take that supland,

07:12.000 --> 07:15.000
materialize it back into SQL or any other query language

07:15.000 --> 07:16.000
that the remote supports,

07:16.000 --> 07:18.000
and when the plan is being executed,

07:18.000 --> 07:21.000
it will take out of its little part

07:21.000 --> 07:23.000
and go in query the remote database,

07:23.000 --> 07:27.000
and this allows you to do some parts of the plan remotely

07:27.000 --> 07:30.000
in the database management system that has that data,

07:30.000 --> 07:33.000
so it can do it very efficiently locally as such.

07:33.000 --> 07:40.000
Yeah, and then if you want to expose your data fusion,

07:40.000 --> 07:43.000
using a standards protocol like Apache RFLIS,

07:43.000 --> 07:46.000
we made a little contribution there that allows you

07:46.000 --> 07:49.000
to just spawn that up very easily,

07:49.000 --> 07:53.000
and with a couple of these systems,

07:53.000 --> 07:55.000
you can couple of these libraries,

07:55.000 --> 07:57.000
you can actually throw together a whole bunch of different systems,

07:57.000 --> 08:00.000
here is some examples from the community that people have built,

08:00.000 --> 08:03.000
from this kind of signal pane of glass for data,

08:03.000 --> 08:05.000
for AI purposes,

08:05.000 --> 08:07.000
to people doing workflow automation,

08:07.000 --> 08:09.000
and even some bad three people,

08:09.000 --> 08:12.000
I think aggregating data over multiple types of blockchains,

08:13.000 --> 08:17.000
so all kind of using the same components in different settings,

08:17.000 --> 08:22.000
also I think showing that this is not created rocket science.

08:22.000 --> 08:26.000
Again, I would really implore people to just try this out.

08:26.000 --> 08:28.000
So with that in mind,

08:28.000 --> 08:30.000
here are the links to all of the stuff,

08:30.000 --> 08:32.000
obviously data fusion itself,

08:32.000 --> 08:34.000
the federation, and the flight disk available server,

08:34.000 --> 08:38.000
and there's also a contribution from the nice people at SpyCi,

08:38.000 --> 08:41.000
we've created a whole bunch of table providers

08:41.000 --> 08:44.000
for basic files,

08:44.000 --> 08:46.000
but also for remote database systems,

08:46.000 --> 08:49.000
which have the support for data fusion federation,

08:49.000 --> 08:50.000
for efficiency sake,

08:50.000 --> 08:54.000
and then you can also see there is a put request for

08:54.000 --> 08:55.000
for TITB,

08:55.000 --> 08:57.000
very much in the

08:57.000 --> 09:00.000
thought set of automatic disk prep earlier,

09:00.000 --> 09:03.000
let's try to get these database to adopt a patchy arrow flight disk

09:03.000 --> 09:04.000
as well,

09:04.000 --> 09:07.000
so that we can get the arrow flowing all the way

09:07.000 --> 09:10.000
from the site down to storage.

09:10.000 --> 09:13.000
All right, and then we just have a little demo

09:13.000 --> 09:15.000
combining all these things together,

09:15.000 --> 09:18.000
and I guess I'll show it.

09:18.000 --> 09:21.000
Yeah, hello everyone.

09:23.000 --> 09:25.000
Yeah, so we built this demo,

09:25.000 --> 09:29.000
based on everything we were talking about,

09:29.000 --> 09:35.000
and this is bound to the darker container,

09:35.000 --> 09:37.000
when it's my school,

09:37.000 --> 09:39.000
and the other is a post-crease,

09:39.000 --> 09:45.000
and now we're trying to run a flighty school server,

09:45.000 --> 09:49.000
and this is the mapping that my friend was talking about,

09:49.000 --> 09:51.000
which we basically,

09:51.000 --> 09:53.000
so each table is in different database,

09:53.000 --> 09:55.000
one is in the my school,

09:55.000 --> 09:57.000
and the other is in the school light,

09:57.000 --> 09:59.000
and the product table in the post-crease,

09:59.000 --> 10:03.000
and we have also mapping for the column as well,

10:03.000 --> 10:17.000
so now we run the client,

10:17.000 --> 10:32.000
so we run the client with the first query,

10:32.000 --> 10:35.000
which is just a filtering with the query,

10:35.000 --> 10:38.000
and this filtering is pushing down,

10:38.000 --> 10:40.000
using the diffusion federation,

10:40.000 --> 10:43.000
and we have also another query

10:43.000 --> 10:48.000
which do join with the multiple remote database table.

10:48.000 --> 10:50.000
If you see this query,

10:50.000 --> 10:53.000
it has like a three table,

10:53.000 --> 10:55.000
one of the orders,

10:55.000 --> 10:57.000
and so it's users,

10:57.000 --> 11:00.000
and it's showing left-chowing with the orders and products,

11:00.000 --> 11:04.000
which each one is locating in different remote database,

11:04.000 --> 11:14.000
and that's the structure for the whole demo,

11:14.000 --> 11:20.000
and how we run this all component together

11:20.000 --> 11:25.000
to have this result.

11:25.000 --> 11:26.000
Okay, great.

11:26.000 --> 11:28.000
If you want to look at the code at some point,

11:28.000 --> 11:29.000
it's just link at the bottom,

11:29.000 --> 11:31.000
you can figure out the entire demo,

11:31.000 --> 11:33.000
basically everything I presented today is in there,

11:33.000 --> 11:36.000
and yeah, it's not in same rocket science again,

11:36.000 --> 11:41.000
so it definitely implore people to try and embed some query engines into your services.

11:41.000 --> 11:43.000
All right, thanks everyone.

11:43.000 --> 11:59.000
Are there any questions?

11:59.000 --> 12:01.000
Yeah, in the demo,

12:01.000 --> 12:04.000
you want to do that slide.

12:04.000 --> 12:07.000
So basically, in our stack,

12:07.000 --> 12:09.000
we have kind of a little bit of an object model

12:09.000 --> 12:12.000
that describes what the external facing world looks like,

12:12.000 --> 12:15.000
and this is what we use to steer the optimizers

12:15.000 --> 12:18.000
to do the mapping between the external user facing

12:18.000 --> 12:21.000
or high-level object model,

12:21.000 --> 12:25.000
and then all the internal storage behind the scenes.

12:25.000 --> 12:28.000
Yeah, there's a lightweight version of that

12:28.000 --> 12:30.000
in the demo we published,

12:30.000 --> 12:34.000
obviously internally we have a more extensive setup.

12:43.000 --> 12:58.000
So I think the question is about the Federation,

12:58.000 --> 13:02.000
how exactly the execution happens there?

13:02.000 --> 13:03.000
Yeah.

13:03.000 --> 13:06.000
So what happens is you look at the entire plan tree,

13:06.000 --> 13:10.000
and the data fusion federation optimizer figures out,

13:10.000 --> 13:13.000
it's a simple plan, so little trees in the overall plan,

13:13.000 --> 13:16.000
but provided by the same remote source.

13:16.000 --> 13:19.000
And then it cuts out that part, all those little trees,

13:19.000 --> 13:22.000
materializes them back to whatever query format

13:22.000 --> 13:24.000
to remote supports, it could be SQL,

13:24.000 --> 13:26.000
or something like FluxQL,

13:26.000 --> 13:29.000
for all the index that they are something like that.

13:29.000 --> 13:31.000
And then when the plan has been executed,

13:31.000 --> 13:34.000
it goes ahead and does those queries on the remote,

13:34.000 --> 13:36.000
and then days of fusion itself,

13:36.000 --> 13:39.000
does the final job of bringing all that together

13:39.000 --> 13:42.000
and creating the final outputs?

13:42.000 --> 13:47.000
Yeah.

13:47.000 --> 13:51.000
Yeah.

13:51.000 --> 13:52.000
Yeah, exactly.

13:52.000 --> 13:56.000
So there is the simple ends are being executed

13:56.000 --> 13:58.000
by the remote database,

13:58.000 --> 14:02.000
and then all tying it all back together happens in data fusion.

14:02.000 --> 14:06.000
So you try to push the compute as down as close to where the data

14:06.000 --> 14:09.000
lives as possible, and where the indexes lives in it.

14:37.000 --> 14:38.000
Yeah, exactly.

14:38.000 --> 14:43.000
I think the question is a little bit about the default interface

14:43.000 --> 14:48.000
in data fusion, so if I go back up.

14:48.000 --> 14:51.000
This is kind of what data fusion provides of the box,

14:51.000 --> 14:53.000
what you can do here is a little bit limited.

14:53.000 --> 14:55.000
You can get data back, you can ask which columns,

14:55.000 --> 14:57.000
you can do some limited filtering,

14:57.000 --> 15:00.000
and then limit how much records we want to get back.

15:00.000 --> 15:02.000
And that's obviously a little bit limited,

15:02.000 --> 15:04.000
and many remote database systems are very,

15:04.000 --> 15:05.000
way more expressive than this.

15:05.000 --> 15:08.000
For example, this would not be able to support joins.

15:08.000 --> 15:10.000
But the alternative here,

15:10.000 --> 15:13.000
what we're doing with data fusion federation is,

15:13.000 --> 15:16.000
by cutting out the supply and you can actually

15:16.000 --> 15:21.000
plug in your own serialization for that supply.

15:21.000 --> 15:23.000
So if you're doing something complicated in there,

15:23.000 --> 15:25.000
you can actually materialize it back in whatever format

15:25.000 --> 15:28.000
that the remote database supports to then execute it

15:28.000 --> 15:31.000
with the full fidelity of the query language.

15:31.000 --> 15:33.000
Get that back into data fusion, and then it will resolve

15:33.000 --> 15:35.000
the rest of the plan.

15:41.000 --> 15:45.000
Wait, I will not be able to understand this, I don't.

15:45.000 --> 15:47.000
Yeah.

16:01.000 --> 16:07.000
So I think the diffusion is a micecule,

16:07.000 --> 16:10.000
or postgres inspired dialect,

16:10.000 --> 16:13.000
and we are kind of ending on top of that.

16:13.000 --> 16:15.000
Then within the federation itself,

16:15.000 --> 16:18.000
you can get to choose in what dialect you materialize

16:18.000 --> 16:20.000
the plan to execute remotely.

16:20.000 --> 16:22.000
But outside of that,

16:22.000 --> 16:25.000
we do just use whatever data fusion uses,

16:25.000 --> 16:28.000
which is a simple ascule.

16:28.000 --> 16:35.000
All right, thanks everyone.