WEBVTT

00:00.000 --> 00:11.520
Thanks everyone for coming. My name is David Cornierzik and I'd like to talk to you about

00:11.520 --> 00:17.680
the state of specifically check public transport data. I've always been a fan of trains

00:17.680 --> 00:25.760
and for the last few and by few I mean what? Eight years I've been working on finding

00:25.760 --> 00:32.560
and improving and then using a check public transport data ideally under some sort of

00:32.560 --> 00:40.480
open data license. I'd like to talk first I'll give you a rough outline on what kind of public

00:40.480 --> 00:45.440
transport we even have to talk about here. A rough history of the central timetable database

00:45.440 --> 00:52.720
because it's fascinating and timetables I feel are sort of the core core data that you need

00:52.720 --> 01:01.840
to do most anything with public transit. Then I'll summarize what data is available right now how

01:01.840 --> 01:08.320
usability is. Then I'll complain about international formats so I'll complain about GTFS

01:08.320 --> 01:15.600
because it doesn't quite work for us or everyone and then I'll give you some links to

01:15.600 --> 01:21.280
fast projects which you can use or even contribute to. I would be happy for your contribution

01:21.280 --> 01:26.320
certainly. So in the check public we have pretty much anything we have we even have fairies we

01:26.320 --> 01:32.160
have trolley buses we have underground trains and the most of these are ran by private or public

01:32.160 --> 01:40.880
agencies generally just private agencies or the city itself has a company made for this purpose

01:41.760 --> 01:51.440
but most and I'd say like 99% by a kilometers driven of these routes are actually ordered

01:51.440 --> 02:00.080
by the government in the broader sense so the central government region or a city. So there's

02:00.080 --> 02:06.720
only a small minority of routes which are fully privately operated. Also complicating matters

02:06.800 --> 02:16.640
is that on anyone route you can generally buy tickets from at least three different tariffs so

02:16.640 --> 02:21.600
the agencies own a city specific one region specific one or one ticket which is supposed to be

02:21.600 --> 02:26.880
a unifying tariff for all the trains and maybe other things but that's just never happened so

02:29.520 --> 02:35.920
we need to encompass all this in our well we want to encompass all this in our public transport

02:35.920 --> 02:45.760
data. What exists right now and exists I mean on a computer in a database. Time tables of course

02:45.760 --> 02:53.840
exist somewhere there is an important division between regular ones and divergence because some

02:53.840 --> 02:59.520
sources will only give you a regular time tables and when something changes you are just out of luck.

03:00.400 --> 03:07.600
Real-time positions and delays of vehicles that's certainly interesting to look at but

03:08.320 --> 03:16.080
also interesting for statistics to see how good the trains are running this is even important

03:16.080 --> 03:21.760
I feel from a political perspective because you need a data to assess whether the person in charge

03:21.760 --> 03:28.400
is actually doing things properly. Train composition that data exists and it's sometimes accessible

03:28.400 --> 03:36.000
to normal people but we'll see later reasons for disruptions also interesting mostly from a historical

03:36.000 --> 03:42.160
perspective to see what's the biggest problem that we should be focusing on but also just

03:42.160 --> 03:47.680
out of curiosity you want to know why your train is five hours delayed right and trajectories

03:47.680 --> 03:55.440
of transit lines. This all exists how much of it is centralized and how much of it is available is

03:55.440 --> 04:04.320
another thing and like I said most routes are ordered by some kind of public institutions so

04:04.320 --> 04:10.240
you can use that as basis for requesting all these data or most of it under a freedom of

04:10.240 --> 04:19.040
information laws. It's not as simple as we'll see in a moment it's easier is the central time table

04:19.040 --> 04:27.600
database of the Czech Republic every time table of every public transport line should be held in

04:27.600 --> 04:34.560
this year under some circumstances like being there a few days earlier in case of most time

04:34.560 --> 04:41.840
table changes it was really actually forward looking I'm not going to say unique in Europe

04:41.840 --> 04:50.480
but very much futuristic in 1998 when this road out and countrywide the first contractor

04:52.080 --> 04:57.680
actually decided to develop it for free maybe very forgot we weren't a communist country anymore

04:59.040 --> 05:07.520
then they got restructured like twice and by the board records that I have found board meeting

05:07.520 --> 05:13.440
records they decided well that's a very bad idea and it's costing us tens of millions of crowns

05:14.720 --> 05:22.160
every year to handle the system so they wanted to cancel the contract of course the ministry

05:22.160 --> 05:29.840
didn't want to pay now because they had like two years of service for free so another company

05:29.840 --> 05:37.120
stepped up and decided fine we'll do it for free but we'll make a new contract the ministry didn't

05:37.200 --> 05:46.800
care at that point so this company helps got a sole permission to use the data for anything that's

05:46.800 --> 05:58.560
not mandated by law so selling the data became one of their primary profits and running services

05:58.560 --> 06:08.000
based on that data this lasted about 10 years almost exactly and then says now one of the biggest

06:08.960 --> 06:15.200
sort of internet services providers but not connectivity providers just services on the internet

06:15.200 --> 06:24.160
provider sort of weird thing that kind of rivals Google in breadth decided they wanted to start

06:24.160 --> 06:37.760
their own trip planner for public transit in a Czech Republic so they requested the dates are

06:37.760 --> 06:44.480
based on a freedom of information law but of course the company didn't want to give them

06:44.480 --> 06:48.720
their data because they built their business model on providing the service for free and then

06:48.720 --> 06:56.240
keeping the data so this took multiple years in course very much tested the law freedom of

06:56.240 --> 07:02.160
information law in ways that it was tested before generated lots of novel judicial opinions that are

07:02.160 --> 07:14.000
now cited in text books so hooray and but they did win though in the meantime the ministry

07:14.720 --> 07:19.840
probably seeing the writing on the wall at the end changed their regulation

07:22.240 --> 07:28.000
sort of delegated regulation that's almost with force of law to make releasing that

07:28.800 --> 07:37.840
timetable data in full mandatory and then after brief spat with the company helps they they decided

07:37.840 --> 07:47.760
to pay for upkeep of this year and perhaps decided fine will release the data except they are

07:47.760 --> 07:56.000
very crafty and they released well only part of the data they released data that looked very much

07:57.040 --> 08:02.800
like what people wanted but was missing among other things two very crucial columns which meant

08:02.800 --> 08:12.080
that stops were no longer uniquely identified so you you've had a big zip file with like 10,000

08:13.920 --> 08:21.600
smaller zip files in there each of them covering one line in one period of time and now try building

08:21.600 --> 08:26.080
a central database back from this sort of data when you can't join on stops

08:26.320 --> 08:35.040
says now didn't pursue any further legal action presumably and now this is my sort of conspiracy

08:35.040 --> 08:48.640
theory because they because they also manage a map website so the web map website that has their own

08:49.200 --> 08:55.520
custom map data for the Czech Republic so what they decided they do is they gathered up

08:56.320 --> 09:03.280
position information on every public transit stop in the Czech Republic and then managed to

09:04.720 --> 09:10.960
that's the conspiracy theory sort of reverse engineer the data back into a usable form

09:11.040 --> 09:22.640
I decided I'm not going to stand for this and I sent my own information requests to

09:22.640 --> 09:28.240
hubs and to the ministry to get the full data back because that's what the specifications says

09:28.240 --> 09:34.480
I should get and that's what will loss as I should get well they didn't see it as quite clear cut

09:34.880 --> 09:47.440
after many years and two parallel lawsuits in basically the same thing just shifted by a week

09:47.440 --> 09:54.640
or something like that and probably thousands of euros spent on lawyer fees

09:55.280 --> 10:09.760
probably maybe more I won I've got slightly better data thank you thank you it's a fascinating story

10:09.760 --> 10:17.120
but I gave a talk on this in Czech and spent 45 minutes on it and I had to go really fast so

10:17.840 --> 10:28.320
so I got uniquely identified stops but perhaps being who they are they only did this for

10:28.320 --> 10:36.480
new newly inserted timetables I am very sure that they have the data for the old ones as well but

10:36.480 --> 10:41.760
it's pretty hard to prove and I've sort of run out of the structured ways of forcing them to do

10:41.760 --> 10:49.200
anything so I decided screw it because in the meantime I managed to do the same thing that says

10:49.200 --> 10:58.640
none did so I gathered lists of stops and their positions from public data and managed to

10:58.640 --> 11:03.600
sort of reconstruct the original data from this and it works with a very high degree of precision

11:03.600 --> 11:10.000
so that now the real issue is not the missing columns which aren't missing anymore in parts of

11:10.000 --> 11:16.320
the data very the real issue now is that the data is just not 100% correct because

11:18.080 --> 11:25.440
perhaps don't use this data specifically for their trip search engine and their commercial products they

11:25.440 --> 11:34.240
also have agreements with specific agencies which use their software and so they don't have any

11:34.240 --> 11:40.160
incentive to actually fix data issues with in the database and the firm industry which

11:40.160 --> 11:46.240
probably should take care of this doesn't want to so short of suing the agencies for sending

11:46.240 --> 11:56.720
wrong data I have to do something else we'll get to that later in the meantime of me fighting

11:56.720 --> 12:02.160
everyone around me the European Commission decided to mandate the release of a bunch of

12:02.160 --> 12:07.920
transport data and timetables in netx you probably heard of it in other talks in this room

12:08.720 --> 12:15.920
the deadlines long gone some countries have done the thing they should have done and produce

12:15.920 --> 12:26.480
actually good data some countries just don't care at all like Slovakia and the the check

12:26.480 --> 12:33.280
Republic decided to do the check thing and the Ministry gave caps a contract to convert the

12:33.280 --> 12:39.520
existing data to netx and the existing data still just as broken as the previous data

12:40.880 --> 12:46.480
now it just it passes excess devaluedation but I hope you know that's not the end

12:47.520 --> 12:53.680
that data shouldn't lie or it should like comply when it comes to semantics and this

12:54.320 --> 13:02.240
I'm sorry it doesn't it's even worse than the source data it's it's basic you can't you can't

13:02.240 --> 13:09.840
make non-broken data from broken data barring any for the inputs like says land data right so that's

13:10.960 --> 13:17.360
sadly not really an option for us so where do we get timetables from well the other public

13:17.360 --> 13:25.680
institutions that play a role in this system and the order most of the routes in the check

13:25.680 --> 13:34.480
Republic so some of them are in lengthened enough to actually release these things as an open data

13:36.080 --> 13:45.280
most of them don't do this sadly but it's a start there's other open data of course

13:46.240 --> 13:56.240
like stop positions in GTFS of course some regions I'd say most of them like 90% of them

13:56.240 --> 14:05.200
publish pretty good geo data databases of stop positions as well even if they don't publish the GTFS data

14:06.880 --> 14:14.320
real-time positions are available in some regions for regional routes and for all passenger trains

14:15.680 --> 14:20.880
sadly that's except for I think two exceptions that's never opened data

14:23.520 --> 14:29.520
it's it's a bit of an issue this is what it looks like in Prague which is part of the Prague

14:29.520 --> 14:37.200
integrated system which just provide very good open data so thanks Prague you can see even the

14:37.200 --> 14:44.640
tip of a zoo is is what tram is actually running on that line right now so you can track and I

14:44.720 --> 14:51.280
have done that when Prague got a new tram I put in the new tram in the search bar down below

14:51.280 --> 14:59.680
and look where I could catch the new tram and then went there so railway data sort of it's own thing

14:59.680 --> 15:05.360
most things in the check system work for railway separately and then for everything else separately

15:05.360 --> 15:11.360
even though trolley buses are actually railway vehicles in the in the check legal system

15:11.360 --> 15:22.240
but Prague as well as the railway manager is I'd even say oh anti-open data because they release

15:22.240 --> 15:29.680
basically nothing and they are very hostile to any freedom of information requests

15:30.960 --> 15:37.840
but they have very cool stuff like historical delays and reasons for delays which aren't published

15:37.840 --> 15:44.560
in any way so you can't even strike them which is a real shame especially when the compensation

15:45.360 --> 15:52.480
for the manager of this organization was at at some point based on a cumulative delays in

15:52.480 --> 16:00.000
the networks so they have their cool stuff they have data for freight trains which might be in which

16:00.000 --> 16:06.160
I'd love to have but they just don't want to give it away now we're getting to the formats

16:06.240 --> 16:13.760
the native format for the check non- railway time tables is yet FJDF yet not yet the

16:13.760 --> 16:22.400
delivery format it's basically just you take the printed standardized form of a time table and put it

16:22.400 --> 16:32.240
into CSV and it works pretty well it's kind of clunky but it works let me have GTFS GTFS sounds great on

16:32.240 --> 16:37.760
the surface but it doesn't represent it can't represent everything that we need in our check system

16:37.760 --> 16:44.240
like it can't represent the mess of tariff zones that we have like you really can't have

16:44.240 --> 16:52.400
one trip that goes between three regional systems and supports two additional sort of

16:52.400 --> 16:58.880
ticket systems on top of that and GTFS is just not made for it it's made for one system in one region

16:58.960 --> 17:04.800
managed by one agency and for a bit in travel within a group of stops it's a real problem

17:06.240 --> 17:13.040
in for example when you have long distance buses they can't travel within one city or even within

17:13.040 --> 17:19.680
one country sometimes and GTFS just can't encode this restriction so you get bad results

17:21.280 --> 17:28.480
in trip plans and netx is awesome but really overcomplicated it's a standard for making standards

17:29.200 --> 17:34.480
and then fine you say fine I'll look at EP IP but EP IP doesn't cover all the things we need

17:34.480 --> 17:41.920
in the checker public so we're back at square one so barring any sort of international cooperation

17:41.920 --> 17:49.440
and netx is I wouldn't say it will not be answered but it's just not ready right now it's not

17:49.520 --> 18:03.360
something I as a programmer can pick up and get to work on now we're getting to the

18:03.360 --> 18:10.160
cheery solutions page many people have failed many people have tried them many people have failed to

18:10.160 --> 18:18.560
make checker wide trip planners mostly that's the holy grail a free and open source multi-model

18:18.560 --> 18:25.360
trip planner so there's basically free surviving projects right now but I've been running for a few

18:25.360 --> 18:33.200
years and are promising and still maintain my own year or until which produces mostly correct

18:33.200 --> 18:40.640
data because it produces GTFS data and JDF but nobody uses that of course there's the transitors

18:40.640 --> 18:47.440
project it's very cool an international and I like it and some people have made a check specific

18:47.840 --> 18:53.120
data source for it but it has the same GTFS issues and then there's spoying coverage it's

18:53.120 --> 19:00.640
very full stack it's a bachelor species of a colleague of mine and it does everything it's

19:00.640 --> 19:07.920
sort of hamstrung by missing data and missing API sometimes because what it tries to do is

19:07.920 --> 19:16.080
tries to do price optimization it's very cool but probably not something that you could extend

19:16.080 --> 19:22.880
beyond the checker public without a lot of work and then as a bonus there's my own article

19:22.880 --> 19:29.680
I'll give you duo which is a so we're not just doing timetables it's a scraper and view of

19:29.680 --> 19:38.080
delay data so you can look at the delays on the railway network for a given given train over

19:38.080 --> 19:43.280
maybe the last month and then you can see do I have a chance to even make that connection very useful

19:43.440 --> 19:53.200
this is what it looks like so if you're interested in this there are a few links you can follow

19:53.200 --> 20:01.200
up on all of them are in check as expected sorry but translators are good these days I'm not going

20:01.200 --> 20:07.840
to get this you know need to learn check and my wishes for the future public institutions please

20:08.400 --> 20:15.760
stop killing my attempts at getting data from you it would cost much less money to just

20:15.760 --> 20:22.880
give neither thing instead of paying tens of thousands of prints to lawyers I never paid any

20:22.880 --> 20:31.920
grants to lawyers and I want so screw you to check transport enthusiasts please I know it's hard

20:31.920 --> 20:36.160
but try to cooperate with other people instead of starting your own project this is not something

20:36.240 --> 20:43.440
you can solve solo I know I tried and for all of you international people trying to make pan

20:43.440 --> 20:52.000
European projects sorry bearer GTFS is not going to cut it please let's make some extensions

20:52.000 --> 20:57.440
or a new format thank you

