WEBVTT

00:00.000 --> 00:12.880
Hello everyone, I'm Elia, I'll be speaking about visualizing mobility data, so the previous

00:12.880 --> 00:19.240
talk was about preparing, collecting data, but we need to be able to analyze it and understand

00:19.240 --> 00:24.040
it when it's good visualization tools as well.

00:24.040 --> 00:32.040
So I am a social engineer, I've been working mostly on geospatial data visualization,

00:32.040 --> 00:38.200
and yeah, I've participated in many projects, working with mobility data, representing

00:38.200 --> 00:42.560
it in various ways, and my PhD was focused on that topic as well.

00:42.560 --> 00:47.960
Also I was working for a company called Terreletics, where we visualized mobility in cities,

00:47.960 --> 00:53.960
how people move in cities for different transportation providers to help them understand

00:53.960 --> 00:56.960
the demand and improve their services.

00:56.960 --> 01:06.640
So we built this dashboard, which allowed people to look at mobility data in cities, and the

01:06.640 --> 01:13.960
company I worked for was a generous enough to allow us to open source this flow mapping

01:13.960 --> 01:21.320
layer as an open source library, which we did, so there is a slow map jail library, which

01:21.320 --> 01:27.320
is a DeGiel layer, custom DeGiel layer implementation, if you have familiar with DeGiel,

01:27.320 --> 01:34.120
which is a framework for efficient geo data visualization.

01:34.120 --> 01:40.480
So you can use, but then I realized that not everybody is a programmer, it was a couple of years

01:40.480 --> 01:47.480
ago now, everybody is, of course, so I developed a tool which people can use without having

01:47.480 --> 01:55.480
to know how to program in JavaScript, which is like, you just throw some data out of

01:55.480 --> 02:03.480
all the data must be in particular format, to Google sheet, Google spreadsheet, then you

02:03.480 --> 02:09.720
just pass the URL to the stool, and it will magically visualize it as an interactive map, and

02:09.720 --> 02:14.720
people started using it in publishing stuff, that was pretty cool.

02:14.720 --> 02:20.720
Then some people started to contribute in back, like there is, for instance, R integration

02:20.720 --> 02:24.720
developed by the gore code of somebody from the community.

02:24.720 --> 02:27.720
So yeah, briefly, what is a flow map?

02:27.720 --> 02:31.720
Who's familiar with a concept flow map?

02:31.720 --> 02:38.720
Not that many people, okay, so it's about utilizing numbers of movements of people or whatever,

02:39.720 --> 02:42.720
it can be good, whatever entities between pairs of geographic locations.

02:42.720 --> 02:45.720
You're not very interested in the exact route.

02:45.720 --> 02:50.720
The people take is more about how many people people or entities move from A to B,

02:50.720 --> 02:55.720
and then you can have additional attributes like time or mode of transport, whatever.

02:55.720 --> 03:02.720
So a flow map is a way to represent that, and often a thickness of the arrows is used to represent

03:02.720 --> 03:08.720
the amount of people moving all the color, or sometimes you see this animated particle animations,

03:08.720 --> 03:14.720
like the number of particles will be represented in the number of people moving.

03:14.720 --> 03:16.720
And the direction is important as well.

03:16.720 --> 03:21.720
So arrow is like naturally understood representation of that.

03:21.720 --> 03:25.720
And the kind of data which is can be represented way,

03:25.720 --> 03:30.720
often called origin destination data OD data, so you have basically a table.

03:30.720 --> 03:33.720
You have origin destination and count.

03:33.720 --> 03:39.720
You can have additional columns for attributes like time or mode of transport.

03:39.720 --> 03:48.720
So in developing this tool, I put a lot of effort into making the visual representation very readable, very understandable.

03:48.720 --> 03:58.720
So in flow map jail, there is a double encoding for the amount of people moving, number of people moving.

03:58.720 --> 04:00.720
It's the thickness and the color.

04:00.720 --> 04:04.720
It is also the sorting of the arrows so that the most important points are at the top.

04:04.720 --> 04:09.720
The arrows have outlines so that when you put a lot of them, they overlap.

04:09.720 --> 04:13.720
You don't have a, they're still recognizable, this is individual arrows.

04:13.720 --> 04:19.720
Then there is a fading, which is like you can configure basically the amount of,

04:19.720 --> 04:24.720
downplay in the less important flows, you can adjust the get darker.

04:24.720 --> 04:30.720
And this plays well with blending, which is used in flow map jail.

04:30.720 --> 04:36.720
So basically it's a way to make sure that underline base map is still readable,

04:36.720 --> 04:39.720
even when you have like thousands of arrows overlapping.

04:39.720 --> 04:45.720
And it's not, it wouldn't work with just opacity because with opacity when you have overlapping arrows,

04:45.720 --> 04:48.720
they would still, like the color would still add up.

04:48.720 --> 04:51.720
So you wouldn't be able to read underline map.

04:51.720 --> 05:00.720
But with blending, you can just use CSS blending because the arrows are rendered in the separate web jail context.

05:00.720 --> 05:02.720
This is a technicality.

05:02.720 --> 05:07.720
But never mind, you can make it so that despite thousands of overlapping arrows,

05:07.720 --> 05:11.720
you can still read the underlying base map.

05:11.720 --> 05:16.720
Then the location totals are presented as circle sizes.

05:16.720 --> 05:20.720
It's a bit more complex than that, but I won't go into the detail.

05:20.720 --> 05:24.720
There is an alternative way of representing the directionality of the arrows.

05:24.720 --> 05:27.720
You can use this fancy animation.

05:27.720 --> 05:34.720
Sometimes it's actually more readable, but often it's just like more appealing.

05:34.720 --> 05:43.720
And still, like with all these techniques, you have too many flows which are kind of producing a noisy picture.

05:43.720 --> 05:49.720
And reducing the performance is not a great way when you have to render all of them.

05:49.720 --> 05:55.720
But many of them actually know it's like, here you see there are flows which start and end outside of the viewport.

05:55.720 --> 05:57.720
Why, what's the point of rendering them?

05:57.720 --> 05:59.720
They only obscure the picture.

05:59.720 --> 06:05.720
So what's lemma, jail is doing is actually an adaptive filtering.

06:05.720 --> 06:14.720
We filter the flows so that only those which are within the start or end within the viewport are rendered.

06:14.720 --> 06:18.720
Also, we don't show more than like a certain number of flows.

06:18.720 --> 06:21.720
You can adjust that.

06:21.720 --> 06:26.720
So when you like zoom in, you see the detail for this particular region.

06:26.720 --> 06:29.720
We also adjust the cables, right?

06:29.720 --> 06:37.720
So when you zoom in, you see the flows which are the largest for this particular viewport will pop up.

06:37.720 --> 06:42.720
So this way you can explore the regions in more detail when you zoom in.

06:42.720 --> 06:46.720
And it improves performance.

06:46.720 --> 06:54.720
Still, you often can get some messy datasets like this one which is bus travels in San Paulo.

06:54.720 --> 07:02.720
Here, like the length of the aerosystem relatively long compared to the relation to the distribution of the locations.

07:02.720 --> 07:04.720
So you get lots of over-plotting, right?

07:04.720 --> 07:06.720
And it's not very readable.

07:06.720 --> 07:08.720
Or here, the opposite extreme.

07:08.720 --> 07:12.720
You have migration in Australia where most people migrate close by, right?

07:12.720 --> 07:14.720
Not across the whole country.

07:14.720 --> 07:18.720
So here you basically don't see the flows at all because that's too short.

07:18.720 --> 07:21.720
So how can we address that?

07:21.720 --> 07:26.720
Maybe we can like create a useful summary which would work at any zoom level.

07:26.720 --> 07:32.720
And for like, independently of what the exact distribution of the locations and flows are.

07:32.720 --> 07:38.720
So for that, the flow of an object is doing hierarchical clustering.

07:38.720 --> 07:43.720
So who's familiar with the hierarchical clustering?

07:43.720 --> 07:45.720
Some people. Okay, so that's fine.

07:45.720 --> 07:47.720
Basically, we have the locations.

07:47.720 --> 07:49.720
We calculate the total flows for the locations.

07:49.720 --> 07:53.720
The total outgoing incoming or like, it can be different metrics.

07:53.720 --> 07:56.720
But how important basically the locations are?

07:56.720 --> 08:03.720
And we start with the largest so that when we group them together, the largest has the most,

08:03.720 --> 08:09.720
like basically define where the cluster will be located, right?

08:09.720 --> 08:15.720
So we don't move the Brussels to some smaller location.

08:15.720 --> 08:21.720
Then we have some, for every viewport, we have zoom level.

08:21.720 --> 08:27.720
We have a certain radius within which we group together locations, which are close by to the locations we start with, right?

08:27.720 --> 08:30.720
So we group them together, we get the clusters.

08:30.720 --> 08:35.720
We move them into the center of masses, weighted by the total flows.

08:35.720 --> 08:38.720
And we get like that cluster locations at this level.

08:38.720 --> 08:45.720
Now we have to recalculate the flows so that we basically aggregate all the flows between the constitutes of the clusters.

08:45.720 --> 08:47.720
And we get the new flows at this level.

08:47.720 --> 08:52.720
So this way we get aggregated summary for this particular zoom level.

08:52.720 --> 09:02.720
And what it's called hierarchical is that we can repeat this process and go up and build a hierarchy of clusters for a number of zoom levels where I interested in.

09:03.720 --> 09:17.720
So with this approach from this messy data says we get something like this, which is much more readable, but we like when we just look only a zoom level, we will lose the details, right?

09:17.720 --> 09:31.720
The good thing is that this is fascinating so that we can do it interactively with zoom in and out and the clusters will expand and collapse depending on the,

09:31.720 --> 09:43.720
zoom level. So this is a public transport and Brisbane and you can see like you can zoom in and see more details about the particular region, how people move there.

09:43.720 --> 09:49.720
This is a true traffic. So with this approach, this is also true traffic.

09:49.720 --> 09:58.720
It's not like origin destination, it's how many people move it within each segment, which can also be kind of converted into OD.

09:59.720 --> 10:11.720
This is a public transport and Zurich and here it is also a temporal dimension you can and you can filter by the tram line or over time.

10:11.720 --> 10:27.720
So by the way, this app is not open source, but much of the technology behind it is and I'm going to open source parts of it is a part of a new project I will mention, but this one is using duck DB, which for me they would duck DB.

10:27.720 --> 10:36.720
A few people, okay, so this is a pretty cool database which has some advantages for this kind of use cases.

10:36.720 --> 10:46.720
First it's made for analytics, it's using a columnar, data representation, which means queries aggregation for instance are pretty efficient.

10:47.720 --> 11:07.720
It's also embeddable in really anywhere. So this example I showed you before that DB was running in the browser directly, so there's no back end serving this queries and every time I zoomed in and out or moved the viewport the word like dozens of SQL queries preparing the data for this particular viewport.

11:07.720 --> 11:18.720
And that DB, since it can run in the browser, by WebAssembly, can do that and it requires very low inference setup.

11:18.720 --> 11:32.720
So basically you just need to download the data from an S3, like a profile, add it to the duck DB running in the browser and you are good to go, you can run queries directly in the browser.

11:32.720 --> 11:41.720
So yeah, in this app, like when you load the data set, it will prepare, it's too small, I know, but don't worry.

11:41.720 --> 11:53.720
It's prepared like pretty aggregated tables, which will make it easier for to run later queries for particular viewport to get the data needs.

11:53.720 --> 12:07.720
So like for different zoom levels, for instance, there is a mapping between original locations and clusters on each zoom level so it can quickly do the mapping and prepare the data.

12:07.720 --> 12:18.720
So this is them. If you want to try it out, you can quickly scan. I won't be showing it for the sake of time.

12:18.720 --> 12:36.720
So yeah, now I'm working on this new open source project. It's called SQL Rooms and it's basically a framework, which helps you to build data analysis apps and it's kind of backed by that DB.

12:36.720 --> 12:47.720
And it's very modular, like the different kind of modules you can functionality, you choose to, you decide which functionality you want to add into your application.

12:47.720 --> 13:07.720
So yeah, like basic data set browser, database browser, you can see the tables you have, you can have an AI systems, which will generate queries for you, SQL queries, which can still run in the browser without sending data to any, a lot of providers.

13:07.720 --> 13:20.720
Many things, and I'm also, yeah, it's been used by a couple of several companies specifically for data intensive data visualization applications.

13:20.720 --> 13:33.720
And I'm working now on, like, it will be an example showing a flow map, which serves the data directly from the DB in the browser.

13:33.720 --> 13:47.720
And yeah, so yeah, and as part of this work, I wanted to have a demo for this conference, but I haven't managed live gotten tense.

13:47.720 --> 13:52.720
Sorry about that, but it's like working progress and explaining what I'm working on.

13:52.720 --> 14:01.720
So for even larger data sets, that DB is already pop pretty powerful. You can have like millions of rows in your, in your flow stable and will be good enough.

14:01.720 --> 14:11.720
But what if you have like many attributes or small temporal back buckets, the number of rows in your table in a flow stable will multiply very quickly.

14:11.720 --> 14:18.720
And at some point it will be too large to, to load the entire thing into the browser.

14:18.720 --> 14:30.720
And you want to avoid that. So, but what you can do, you can prepare your data set in a way so that you can only fetch parts, which you personally need for the current viewport.

14:30.720 --> 14:41.720
And that DB supports HTTP range queries. So basically, if you have a archive file somewhere in S3.

14:41.720 --> 14:51.720
And it has a like column by which you can filter, you can say I want like this range for values of this column.

14:51.720 --> 14:59.720
And if the rows are sorted in the right way, then it will only need to read those parts of the table, which satisfy your,

14:59.720 --> 15:12.720
your query condition, right? So if you can do it this way, then it will be fetching data on demand as you change your viewport without having to load the entire table.

15:12.720 --> 15:25.720
And so to do this, who is familiar with space filling curves? Just a few people, okay? So this is a mathematical concept, but it's actually actually easy to explain.

15:26.720 --> 15:35.720
Like the purpose is to compress two dimensions into single dimension. So you have like two columns x and y, right?

15:35.720 --> 15:44.720
Oh, let it you launch it better use x and y projected already. And then you want to, this are two numbers, you can save them separately.

15:44.720 --> 15:54.720
But our purpose is like to have a column by which we can sort so that then it will be easy to query so that things which are closed by close to each other.

15:54.720 --> 16:03.720
I'll also close in this table close to each other so that we can, we need to fetch fewer chunks, right?

16:03.720 --> 16:13.720
And so this is the way to do that. Basically, you draw a curve, which would fill the entire plane, right, to the plane.

16:13.720 --> 16:20.720
And then like at every junction you put a number one, two, three, four, and so on. And this number is the index, right?

16:20.720 --> 16:32.720
Which you can then put in your column. And this way you can reduce two dimensions x and y into single index, right?

16:32.720 --> 16:42.720
And this will be the number you want to sort it by, so that you then need to, to read a few, a few chunks from the table.

16:42.720 --> 16:54.720
If you sort by this number. And that VP has a function as the Hilbert, that DB spatial extension, you need to load the spatial extension, to calculate this index.

16:54.720 --> 17:07.720
And as part of the project I'm working on, there will be a Python script preparing like you give it a normal OG dataset with just like a simple plane origin destination count table.

17:07.720 --> 17:14.720
This will prepare this kind of, part of our sorted and with the index.

17:14.720 --> 17:20.720
Yeah, the only, yeah, so this is like kind of a SQL query you need to prepare this.

17:20.720 --> 17:31.720
But the trick is that you need to do this calculated calculation of the Hilbert in its twice, because we have let alone, right?

17:31.720 --> 17:38.720
But we have it for the origin and destination for OG data, you don't, it's not just one point, right? It's starting then.

17:38.720 --> 17:48.720
So you can, but you can do it twice. You get it first for each for the origin destination, let alone.

17:48.720 --> 17:54.720
And then you get two numbers, right? But you can also use as the Hilbert for these two numbers, you get for the origin destination.

17:54.720 --> 17:58.720
And it compresses it again. So you get from four dimensions into single dimension.

17:58.720 --> 18:04.720
You sort your table by the single, a column flow H in this code case.

18:04.720 --> 18:13.720
And then the flows which will have, which will have the same or closed number in this column will be also close to each other on the map.

18:13.720 --> 18:18.720
And then we can, we don't need to read as many chunks to represent this data.

18:18.720 --> 18:23.720
So yeah, that's it.

18:23.720 --> 18:29.720
Follow this word, yeah, best is probably SQL rooms, where new stuff will be appearing.

18:29.720 --> 18:35.720
And otherwise, flow map jail, flow map blue, flow map blue are, flow map city.

18:35.720 --> 18:39.720
And here's the example, you can try. Thank you.

18:39.720 --> 18:46.720
Thank you.

18:46.720 --> 18:53.720
There's a time for questions. No, okay, sorry about that.

18:53.720 --> 18:56.720
But reach out to me.