WEBVTT

00:00.000 --> 00:30.000
So we have 8, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

00:30.000 --> 00:37.000
We get a chance to talk about this urgent topic of small data.

00:37.000 --> 00:46.000
I believe that we are all, or that have been at some point, been hoarding data in our lives.

00:46.000 --> 00:54.000
At some point, when I first started to work with data a little bit more serious, about 10, 15 years ago,

00:54.000 --> 00:58.000
that time I was a product manager for a big media site.

00:58.000 --> 01:01.000
We were a bit naive.

01:01.000 --> 01:07.000
We were trying together as much data as possible about our readers,

01:07.000 --> 01:10.000
because we wanted to improve the reading experience.

01:10.000 --> 01:13.000
We wanted to improve the reading time and business.

01:13.000 --> 01:19.000
I think we didn't do too much harm at that time because the systems were also very immature.

01:19.000 --> 01:25.000
But looking back, I can wonder now, why did no one ask the question,

01:25.000 --> 01:31.000
like, can you find these insights in other ways, other options?

01:31.000 --> 01:35.000
And it's getting worse, it's getting worse.

01:35.000 --> 01:42.000
So if we work with, today, with health data, for instance,

01:43.000 --> 01:48.000
also people I know who are really, really engaged in that,

01:48.000 --> 01:54.000
who are hoarding data to make better medicines, to make more precise medicines,

01:54.000 --> 01:56.000
which is a really good cause.

01:56.000 --> 02:01.000
I have several people in my nearby who suffers from severe diseases,

02:01.000 --> 02:06.000
and could really use some good, more precise medicines,

02:06.000 --> 02:10.000
but no one really asks that question today either.

02:10.000 --> 02:14.000
So that's what we're going to talk about today.

02:14.000 --> 02:18.000
We're working with the project Ernest Tapir,

02:18.000 --> 02:25.000
which is one of the purposes is to find cyber threats in the data,

02:25.000 --> 02:29.000
in this, in this queer data, and in cyber crime,

02:29.000 --> 02:35.000
the hoarding of data is really getting troublesome.

02:35.000 --> 02:42.000
So my name is Ulika Vincent, and I've been working with this project,

02:42.000 --> 02:47.000
the Ernest Tapir for about two years, and I recently picked up coding again,

02:47.000 --> 02:52.000
a couple of years ago, which was a very good idea, my life got better.

02:52.000 --> 02:56.000
And I'm working together with Michael Kulberg,

02:56.000 --> 03:00.000
who is one of the founders of the project, and also data architect,

03:00.000 --> 03:03.000
and a lot of other things, too.

03:04.000 --> 03:11.000
So first I'm going to just give you a short overview of what our project is about,

03:11.000 --> 03:16.000
and then Michael will get into a little bit more detail.

03:16.000 --> 03:25.000
So, first, before introducing the Ernest Tapir project,

03:25.000 --> 03:28.000
I want to introduce our way of working,

03:29.000 --> 03:32.000
and the big data is to come a very common way,

03:32.000 --> 03:37.000
it's the salvation of all knowledge in the world.

03:37.000 --> 03:40.000
You gather as much data as possible,

03:40.000 --> 03:45.000
you try to comply with laws by doing checkboxes,

03:45.000 --> 03:50.000
and you try to protect the sensitive data with all kinds of chills,

03:50.000 --> 03:55.000
and try to make customers and others trust you.

03:55.000 --> 03:59.000
But we are trying to do, or what we are doing, actually,

03:59.000 --> 04:05.000
not just trying, is that we work with very sensitive data,

04:05.000 --> 04:09.000
but we want to collect minimum to get the insights needed.

04:09.000 --> 04:14.000
Oh, they come, oh, sorry, can't walk around.

04:14.000 --> 04:16.000
Sorry.

04:16.000 --> 04:21.000
And we try every day to find ways to throw away data,

04:21.000 --> 04:24.000
as soon as possible.

04:24.000 --> 04:27.000
We want to distribute the storage of the data,

04:27.000 --> 04:31.000
and instead of just filling in checkboxes,

04:31.000 --> 04:35.000
we want to be compliance by the sign,

04:35.000 --> 04:41.000
and we do protection by sort of differential privacy.

04:41.000 --> 04:44.000
So, what is the Ernest Tapir?

04:44.000 --> 04:47.000
The Ernest Tapir is a privacy first,

04:47.000 --> 04:52.000
open source platform with localizations for analytics

04:52.000 --> 04:57.000
on DNS query data.

04:57.000 --> 05:05.000
And it runs on tapir runs next to the recursive resolver,

05:05.000 --> 05:09.000
for those of you who might not know the queries

05:09.000 --> 05:13.000
that are sent when you ask you or your application

05:13.000 --> 05:16.000
do a look up on the internet,

05:16.000 --> 05:19.000
passes a recursive resolver.

05:19.000 --> 05:26.000
And we upload events and aggregates to a cloud analytics platform.

05:26.000 --> 05:30.000
And publishes observations back to the edge,

05:30.000 --> 05:33.000
that can take some action of it.

05:33.000 --> 05:39.000
Just to give a very quick view on how the data looks like,

05:39.000 --> 05:43.000
if you're not doing this every day.

05:43.000 --> 05:48.000
This is what happens when I load BrusselsTime.org.

05:48.000 --> 05:51.000
This is all the DNS queries sent.

05:51.000 --> 05:56.000
So, here you can find a lot of interesting things, both threads,

05:56.000 --> 05:58.000
such as botnet, etc.

05:58.000 --> 06:04.000
but also leaks of identifiable information about you.

06:04.000 --> 06:09.000
But we want to look at this data to observe it

06:09.000 --> 06:16.000
and to find strange things and bad actors.

06:16.000 --> 06:20.000
And it's toxic.

06:20.000 --> 06:27.000
Some design principles we are using aggregation,

06:27.000 --> 06:33.000
where we separate data sets really.

06:33.000 --> 06:43.000
And we try to make or we're doing individual tracking

06:43.000 --> 06:47.000
impossible by design.

06:47.000 --> 06:50.000
And you can't after our aggregation,

06:50.000 --> 06:56.000
we can't do reverse engineering to find it.

06:56.000 --> 07:00.000
So, we also work with minimization.

07:00.000 --> 07:08.000
Other solutions often do minimization after extracting

07:08.000 --> 07:11.000
the ETL process.

07:11.000 --> 07:16.000
But we do transformational minimization at the source.

07:16.000 --> 07:27.000
But the main idea is that when we do aggregation,

07:27.000 --> 07:34.000
instead of looking at individual traces,

07:34.000 --> 07:39.000
then we won't be able to have any data that can be used.

07:39.000 --> 07:41.000
I tried to teach my children this.

07:41.000 --> 07:46.000
Like, data you share will probably be leaked every day.

07:46.000 --> 07:51.000
And we as developers or analysts or product people

07:51.000 --> 07:57.000
should be aware of this.

07:57.000 --> 08:01.000
We also kind of working with differential privacy,

08:01.000 --> 08:05.000
which is that the results from our observations

08:05.000 --> 08:09.000
won't differ in a significant way.

08:09.000 --> 08:13.000
If you're individual browsing behavior is in the data

08:13.000 --> 08:16.000
or not to simplify it.

08:16.000 --> 08:19.000
And it has the reliability.

08:19.000 --> 08:25.000
So you can always, always state that it's not possible

08:25.000 --> 08:29.000
to find you in the data.

08:29.000 --> 08:34.000
Another thing we're doing is that we want to publish

08:34.000 --> 08:40.000
these observations to the public and partners third parties.

08:40.000 --> 08:42.000
By doing that from the start,

08:42.000 --> 08:45.000
design the system to share the data,

08:45.000 --> 08:50.000
then we will make better priorities every day

08:50.000 --> 08:56.000
because otherwise the ISPs who run the NSTA peer on their resolvers

08:56.000 --> 08:58.000
would entrust us.

08:58.000 --> 09:01.000
They would never share the data.

09:01.000 --> 09:04.000
When they know we're going to publish it.

09:04.000 --> 09:08.000
So it has to work in our architecture.

09:08.000 --> 09:10.000
So to summarize,

09:10.000 --> 09:14.000
stop the pathological hoarding we're saying.

09:14.000 --> 09:17.000
Well, over to Michael.

09:17.000 --> 09:19.000
Thank you.

09:19.000 --> 09:20.000
Okay.

09:20.000 --> 09:23.000
So I need to get into the box.

09:23.000 --> 09:28.000
I don't like boxes, but I'll stick to this one.

09:28.000 --> 09:32.000
So I'm just going to go into the technical stuff for a little bit.

09:32.000 --> 09:35.000
Because everyone keeps asking me about the technical bit.

09:35.000 --> 09:38.000
And we've created this analysis platform.

09:38.000 --> 09:41.000
And this is how the current sausages are made.

09:41.000 --> 09:46.000
And some useless details are with land data.

09:46.000 --> 09:49.000
And we use bark and nuts and stuff to analyze it.

09:49.000 --> 09:51.000
And we have microservices.

09:51.000 --> 09:55.000
And Jupiter hub is the interface to meet analysts.

09:55.000 --> 10:02.000
So this was the quick review of the technical part of analysis.

10:02.000 --> 10:07.000
When it comes to the segmentation part that we've mentioned,

10:07.000 --> 10:09.000
this is an overview.

10:09.000 --> 10:14.000
Basically, you have internal stuff that ISPs don't want us to see.

10:14.000 --> 10:16.000
And that was with throw away.

10:16.000 --> 10:20.000
And then we have the ones that we sort of already know about.

10:20.000 --> 10:25.000
Those we gather up and aggregate and throw away stuff.

10:25.000 --> 10:29.000
And basically say, oh, I really can't put up with Google this week.

10:29.000 --> 10:32.000
So everything under Google goes into a bucket.

10:32.000 --> 10:35.000
And I just know I'm in a Google query store.

10:35.000 --> 10:42.000
Then we have unique events where we're interested in unique events.

10:42.000 --> 10:47.000
Because there's an in cybersecurity.

10:47.000 --> 10:53.000
Typically, 90% of all new domains are malicious in some way.

10:53.000 --> 11:02.000
So having those domains available is a very good way of predicting what's going to go bad in a short term.

11:02.000 --> 11:05.000
So the first time we see any of these domains, we send them.

11:05.000 --> 11:10.000
But we disconnect them from any other queries or any user information, etc.

11:10.000 --> 11:13.000
So we just basically get a domain that says, oh, this one's new.

11:13.000 --> 11:16.000
Or at least that server thought it was new.

11:16.000 --> 11:19.000
And then we need to figure out if it's actually new.

11:19.000 --> 11:21.000
And then we have the things that are in between.

11:21.000 --> 11:26.000
Now, when you guys are going shopping or you're walking down the street and you're looking at all the places you pass,

11:26.000 --> 11:30.000
you create a pattern of where you are.

11:30.000 --> 11:34.000
And that pattern is in itself an identifier.

11:34.000 --> 11:38.000
So as you saw the queries that came from this newspaper,

11:39.000 --> 11:46.000
so one of the newspapers I know about, they have like 380 queries for their front page, basically.

11:46.000 --> 11:52.000
And there's a number of different ad tokens that will identify you in different ways.

11:52.000 --> 11:57.000
And when your computer connects to a network, it will typically go check if if you're running Windows,

11:57.000 --> 11:59.000
it's going to ask, are there in updates available?

11:59.000 --> 12:05.000
And then you have all your software that also wants to know about updates, all these things create patterns.

12:05.000 --> 12:14.000
And those patterns are both interesting, but also identifying and, well, toxic.

12:14.000 --> 12:20.000
So those end up in the bucket over at the end, where we currently we're throwing them away,

12:20.000 --> 12:23.000
because we built all the other stuff.

12:23.000 --> 12:34.000
But we do need a local analysis platform that aggregates this data and removes the patterns and identifying information.

12:34.000 --> 12:39.000
So this is the segmentation part.

12:39.000 --> 12:45.000
And for this data, well, this is where it ends up being, and it's probably not readable.

12:45.000 --> 12:52.000
But here we have our histograms and it's basically Google and counts.

12:52.000 --> 12:58.000
Well, these are sketches, and this is a null sketch.

12:59.000 --> 13:02.000
The sketches will identify you if you have two few of them.

13:02.000 --> 13:12.000
So we have a hard cut off for these, where you won't really have a sketch until the number of users passes specific number.

13:12.000 --> 13:14.000
And our current number is 20.

13:14.000 --> 13:27.000
So at the point where you're one of 21 or so users, there's going to be a sketch here to be able to handle the,

13:27.000 --> 13:33.000
well, I lost the word, but anyway, user count is one of those with cardinality problems, right?

13:33.000 --> 13:46.000
So to handle cardinality of users, we use, each of those sketches are hyperlog log sketches to maintain an approximation of the users across time.

13:46.000 --> 13:50.000
When it comes to events, these are the events.

13:50.000 --> 13:55.000
Like, well, this one is really interesting.

13:55.000 --> 14:01.000
So as a WCDN app will come with higher and lower cap letters.

14:01.000 --> 14:05.000
That's a strategy for adding more bits into DNS.

14:05.000 --> 14:08.000
That's why it's upper load case letters.

14:08.000 --> 14:12.000
They just show up once from each server.

14:13.000 --> 14:22.000
And we aggregate them centrally to see if they're actually completely new or if they just show up late somewhere.

14:22.000 --> 14:30.000
And the methods we use for doing this, well, it's a patchy spark.

14:30.000 --> 14:38.000
I picked this particular one because an interesting fact that we learned the hard way is that a patchy spark is running on JVM,

14:38.000 --> 14:43.000
but actually it doesn't know about 64 bit unsounding numbers.

14:43.000 --> 14:47.000
So you actually have to do some really weird stuff too.

14:47.000 --> 14:53.000
If you're using a bit string on your analysis or your data collection platform,

14:53.000 --> 14:59.000
and you're sending that up and you're using a 64 bit string to indicate things,

14:59.000 --> 15:06.000
there's only going to be 63 of them available in Scala.

15:06.000 --> 15:11.000
So basically we just replace them with letters and use that.

15:11.000 --> 15:15.000
So when it comes to sharing this data,

15:15.000 --> 15:18.000
data commons is a fabulous idea.

15:18.000 --> 15:21.000
I love it. I really want to share this data.

15:21.000 --> 15:23.000
Sharing data is a bit tricky.

15:23.000 --> 15:25.000
I don't know if you guys remember this,

15:25.000 --> 15:31.000
but Strava made interesting boxes on the maps in Afghanistan,

15:31.000 --> 15:34.000
and in the architects.

15:34.000 --> 15:38.000
So that was unfortunate.

15:38.000 --> 15:40.000
And this is because they're sharing data,

15:40.000 --> 15:46.000
and they weren't really thinking that this would in any way be problematic.

15:46.000 --> 15:54.000
Netflix also had this prize where he was supposed to do things with their recommendation system.

15:54.000 --> 16:02.000
And it turns out that if you have very specific tastes in which movies you watch,

16:02.000 --> 16:07.000
yes, it is possible to find you in aggregated data.

16:07.000 --> 16:11.000
And this is one of the concerns we have for our data,

16:11.000 --> 16:19.000
because I mean the chances that there is someone asking some really odd questions is pretty high.

16:19.000 --> 16:23.000
I know for a fact that I would probably stand out,

16:23.000 --> 16:28.000
but that makes it hard to share the data.

16:29.000 --> 16:37.000
So when it comes to, when we've been working on this to create data that we believe we can share,

16:37.000 --> 16:41.000
we've had a very basic design strategy.

16:41.000 --> 16:44.000
And as we design it as well as we can,

16:44.000 --> 16:47.000
and we try to break it, like really break it,

16:47.000 --> 16:52.000
and then we redesign it, and then we go back to trying to break it.

16:52.000 --> 16:57.000
And at some point we'll probably want someone else to try to break it as well,

16:57.000 --> 17:01.000
before we hand over all the data.

17:01.000 --> 17:10.000
There's also a tricky part about this data since it's related to security research, et cetera.

17:10.000 --> 17:17.000
Chances are that any publication of the data needs to be somewhat delayed,

17:17.000 --> 17:26.000
so that all the security people get a chance to sort of make sure that they fixed all the stuff that potentially could fall out from this data.

17:26.000 --> 17:30.000
So, but having this as your goal,

17:30.000 --> 17:38.000
actually put some pressure on you to think a number of times about your data before you release it to anyone.

17:38.000 --> 17:46.000
And this, I don't think it's unique for DNS data.

17:46.000 --> 17:51.000
I think this is something that can be applied to a number of different types of data,

17:51.000 --> 17:54.000
coming from networks and computers.

17:54.000 --> 18:02.000
I can think of a number of things there, but it would be interesting to know about other areas where this could be applied.

18:02.000 --> 18:09.000
So, if you have ideas where you can use these strategies in some really wild other field,

18:09.000 --> 18:12.000
I would definitely be interested in nobody.

18:12.000 --> 18:20.000
So, I believe that is the main gist of what we're doing.

18:20.000 --> 18:25.000
So, I think we'll leave a lot of time for questions, I hope.

18:25.000 --> 18:38.000
Yeah, I can, I can just mention also that we would love you to try to break our model.

18:38.000 --> 18:46.000
So, if you like to contribute in some way or just follow our work on even if you're not in the DNS world,

18:46.000 --> 18:49.000
it could be interesting to exchange experiences.

18:49.000 --> 18:54.000
Please reach out to us by email or most of them are linked in,

18:54.000 --> 18:57.000
or check out the site and rep of because it's,

18:57.000 --> 19:03.000
I didn't put it here, but we have the rep on GitHub and just reach out for any,

19:03.000 --> 19:09.000
and a, a fetch to try to break our model.

19:09.000 --> 19:12.000
Yeah, questions.

19:21.000 --> 19:24.000
Then maybe I can have a question to you.

19:25.000 --> 19:33.000
Do you enter the project that you can read on the view and that there's view there,

19:33.000 --> 19:38.000
or do you just log in and research what these are looking for?

19:44.000 --> 19:49.000
So, what we do?

19:49.000 --> 19:51.000
Yeah, you put the question.

19:51.000 --> 19:58.000
The question was, if we are filtering filtering the,

19:58.000 --> 20:04.000
the means from the user and the NSTP doesn't make filtering decision.

20:04.000 --> 20:11.000
The NSTP which is a bit different from other solutions that makes filtering decisions.

20:11.000 --> 20:17.000
We publish observations to the resolver operator where we, for instance,

20:17.000 --> 20:22.000
can, an observation can be that it's a new domain, low rank,

20:22.000 --> 20:25.000
a sudden ramp up and things like that.

20:25.000 --> 20:29.000
And then there is a module in the NSTP.

20:29.000 --> 20:34.000
We call the policy processor, which can take these observations,

20:34.000 --> 20:39.000
and decide in the policy together with other sources,

20:39.000 --> 20:45.000
such as list, etc., and set the response on that domain.

20:45.000 --> 20:48.000
Okay, now I've got three observations from TAPI.

20:48.000 --> 20:53.000
Plus, this domain has these characteristics.

20:53.000 --> 20:55.000
Okay, let's filter it.

20:55.000 --> 20:58.000
So, TAPI just observe.

20:58.000 --> 21:04.000
We want to, and not decide if it's a bad or a good domain.

21:04.000 --> 21:07.000
Yeah.

21:07.000 --> 21:12.000
So, and the question I would like you to bring home is,

21:12.000 --> 21:17.000
like, when did I throw away data later?

21:18.000 --> 21:32.000
Thank you.

21:32.000 --> 21:33.000
Thank you.

21:47.000 --> 21:52.000
Thank you.

