WEBVTT

00:00.000 --> 00:18.400
Okay, good, so welcome to the talk thanks for coming, so let's talk about bike sharing

00:18.400 --> 00:23.800
for a moment for those of you that are not familiar with bike sharing if these funny bicycles

00:23.800 --> 00:28.800
that they are around on the cities they have stations, then people are getting the bicycles

00:28.800 --> 00:32.480
and leaving bicycles on stations, right?

00:32.480 --> 00:37.720
Somehow, ten years ago I started building this project that was only supposed to be for

00:37.720 --> 00:44.760
bike sharing Barcelona, I ended up scraping bike shared data for almost a whole world,

00:44.760 --> 00:49.600
this around eight hundred cities around the world that supported.

00:49.600 --> 00:55.000
But the scope of the talk is more of what kind of information the project has.

00:55.000 --> 00:59.000
So the first one is availability data, when you look at availability data it's basically

00:59.000 --> 01:03.960
a point which is a station, has latitude, longitude and then you have information like

01:03.960 --> 01:07.600
the number of bicycles that there are and the time step where this information was

01:07.600 --> 01:11.320
accessed, I call this availability data.

01:11.320 --> 01:16.920
Then, nicer organizations like the city bike system in New York, they have something called

01:16.920 --> 01:24.240
TripData, TripData is basically a point A from point B, that's a trip, it's somebody

01:24.240 --> 01:30.240
that has one trip from one point to the other, then you have something like, at what?

01:30.240 --> 01:33.480
When did it started, when did it ended and so on?

01:33.480 --> 01:39.680
So this is very nice information to work with, but of course not all organizations are

01:39.680 --> 01:46.560
giving this kind of information because it's, I don't know, expensive, interesting, whatever.

01:46.560 --> 01:51.840
So I came up with this I read, I call it time series of availability data, which is basically

01:51.840 --> 01:57.840
the same availability data that I was talking at the start, but over time, so you can see

01:57.840 --> 02:04.160
for one particular station, all the different stats that it has gone through and it's

02:04.160 --> 02:11.120
only showing you whenever it has changed, so it's not time series in the sense of fixed

02:11.200 --> 02:17.240
intervals, but whenever the information has changed.

02:17.240 --> 02:25.280
Okay, so, this life demo, first of this lemmer, I have no idea of what I'm doing in

02:25.280 --> 02:34.320
terms of years, my background is systems engineering, so all of this, it's been mostly learned

02:34.320 --> 02:45.480
as I went or as I go, so the data that you can get from CTVX, it's on this website,

02:45.480 --> 02:49.600
these are different target files that you can get and there is one for every month.

02:49.600 --> 02:55.240
So we're going to be working with the month of December of last year, luckily I already

02:55.240 --> 03:01.320
have it downloaded, we're going to use our favorite tool, which is that TV, it's the

03:01.320 --> 03:11.000
font nice, bigger, yeah, good, so when we read one of these packet files that I have available

03:11.000 --> 03:17.720
on the internet, it's the information that it's contained on it, there is the tag that

03:17.720 --> 03:24.520
that basically it's what it identifies a particular system, then I read the ID of the station,

03:24.520 --> 03:29.240
it's supposed to be unique, I'll call the cross data set, but sometimes it's not, especially

03:29.240 --> 03:37.080
with latitude, longitude that are 0, 0, then there is this new ID, whether there is a lot

03:37.080 --> 03:51.360
of info in there, if we look at it, maybe just get 10 rows of it, wait, so this is what we get

03:51.440 --> 04:02.320
for instance, for the system over here, in Brussels, oops sorry, a lot of that, we get this

04:02.320 --> 04:10.080
information available, which is one station and then we get for instance, there were seven

04:10.080 --> 04:16.160
bikes on 18 spaces, at this particular point in time, then at another point in time, suddenly

04:16.160 --> 04:25.760
it got 8 bikes, then 9 bikes, 8 bikes, and so on, and so on, five minutes left, the good thing

04:25.760 --> 04:30.960
is if I don't finish in time, the end result is also available, so we can look at it, even

04:30.960 --> 04:42.160
though we didn't manage to create, then tired of it, so let's, what table I am at, everything

04:42.160 --> 04:51.760
from full, limit 100, now there is no full, okay, then I have the data already loaded in a table,

04:53.360 --> 05:02.000
so this is what this table looks like, there are lots of rows, but still it will work

05:02.000 --> 05:08.480
because it's that TV, but yeah it has a geometry, which means I should also load special

05:08.560 --> 05:18.960
in here, and we want, so what can we do with this information, if for instance, we try to look,

05:20.400 --> 05:26.400
so what I am trying to do is get one metric that we can represent in a map, and generate a

05:26.400 --> 05:32.480
hit map of the usage of a particular bike sharing network, and to get that, we only, I mean we have

05:32.560 --> 05:43.200
bicycles available, which are kind of okay, like here in New York, sorry,

05:45.600 --> 05:51.440
name, bikes, and free, and then some timestamp, I mean this is good information, but we need to

05:51.440 --> 05:57.680
aggregate it somehow, so something that could be, we could be doing, we could be counting,

05:57.680 --> 06:02.400
and then number of entries, but it's also more interesting, we just try to get

06:02.400 --> 06:10.080
the difference between each set, we can do that with a window function, so let's just jump into

06:10.080 --> 06:18.000
this window function lag here here, yeah it's only never, I'm sorry, this is not going to be able to

06:18.000 --> 06:39.840
be done live, but here, yeah this one, what we are doing, so what then, if we add everything together

06:39.840 --> 06:44.800
and we just put the absolute, we can kind of say this is more or less an activity, right,

06:44.800 --> 06:53.520
or maybe not, you tell me after, okay, so yeah we have this, we can, for instance, we could

06:53.520 --> 06:59.440
order, I mean, limit by 10, then we get the top 10 stations in a particular system,

06:59.440 --> 07:06.480
that's already good information, but then if we create a geosson file with this query,

07:07.200 --> 07:12.400
and I'm just going to skip and just generated for the whole world and not for the video system,

07:13.040 --> 07:23.280
then I go to this website that I created just for this, then we open it, here, oh enter,

07:25.840 --> 07:32.720
go on, work for me, yes and we have a heat map of the whole Brussels, the whole world,

07:33.840 --> 07:39.920
there is no limit, so I don't know if somebody wants to see a particular city, here it is,

07:39.920 --> 07:49.120
we have Paris and the times up and this is the world, thank you, thank you,

