WEBVTT

00:00.000 --> 00:11.760
Hi everyone, my name is Salix and I'm an astronaut engineer at the Freedom of the Press Foundation.

00:11.760 --> 00:20.600
Today I'm going to discuss about the subject of removing traces from documents real quick

00:20.600 --> 00:27.820
what we're going to see today, I want to discuss mostly about what are the known DNA

00:27.820 --> 00:34.620
and organization vectors we are aware of that may be in documents, we will show you

00:34.620 --> 00:42.380
our service on tools that can help you anonymize those documents and remove traces from them.

00:42.380 --> 00:46.860
And of course we're going to see limitations because the subject of making removing traces

00:46.860 --> 00:55.420
from your activity from documents is actually really hard, if I can speak up, I'll try.

00:55.420 --> 01:01.660
And then finally due to the limitations, we will try to see how we can dump them,

01:01.660 --> 01:10.380
how we can cross this gap between the technical and the practical in order to overcome

01:10.380 --> 01:18.060
let's say these limitations and help people who want to stay anonymous with some practical

01:18.060 --> 01:27.420
devices. Unfortunately, I'm not going to provide some sort of a novel way of removing sophisticated

01:27.420 --> 01:34.060
metadata and traitor tracing schemes that may exist in documents, it's not going to be also

01:34.060 --> 01:41.980
like a deep course in metadata removal, it's mostly going to be like what's out there, what

01:41.980 --> 01:52.300
tools exist to help there and some extra devices. So let's jump writing to the DNAization

01:52.300 --> 01:59.900
vectors. What I'm going to show you today is I'll try to do this to categorize some of the

01:59.900 --> 02:05.100
vectors that are in the literature, hopefully it will be a good categorization and extensive

02:05.100 --> 02:15.020
one. We will focus on this show solely on what is inside the document because the subject of

02:15.020 --> 02:22.460
anonymously obtaining a sensitive document and sharing it with someone else is a whole different subject

02:22.460 --> 02:29.820
that leaves traces outside of the document and we're not going to talk about that. Quick terminology

02:30.140 --> 02:35.420
in case you are not aware of, we're going to talk about OCR at some point, OCR is optical

02:35.420 --> 02:40.700
character recognition and it's the process of retrieving text from images and we're going to talk

02:40.700 --> 02:48.460
about flattening which is making a document, a page let's say a document, but has complex components

02:48.460 --> 02:56.940
making it like a flat image. We're going to talk about metadata but metadata are just the tip of the

02:57.020 --> 03:04.780
iceberg. There are a lot of many more DNAization vectors and if you if you're doing this

03:04.780 --> 03:09.660
presentation you find an example that doesn't fall under one of those categories I will provide

03:10.300 --> 03:16.860
please pick up, would like to know by it. I'm going to start with a very simple example, a very

03:16.860 --> 03:26.540
no one, a vice reporter at some point to the picture of my coffee he was hidden somewhere in the

03:26.540 --> 03:34.700
South America and for go to scrub the metadata of the location while he uploaded and that way

03:34.700 --> 03:41.180
my coffee was traced. So lesson number one okay sure scrub them with the data from simple documents

03:41.340 --> 03:48.140
like files and audio. Now scrubbing them with the data is not that simple when you encounter

03:48.140 --> 03:54.460
complex document types. In this particular case Google at some point tried to anonymously

03:55.260 --> 04:02.300
do some sort of litigation against the eBay in the Australia and the Australian justice

04:02.300 --> 04:08.140
department of justice or whatever wanted to keep the identity anonymous but they forgot that

04:08.700 --> 04:14.300
the title of the world document that they received was also within the PDF as some sort of

04:14.940 --> 04:23.100
metadata and this was probably hard to find the figure out. So a bit of a full proof solution there

04:23.100 --> 04:28.380
is to just flatten the document in order to remove those kind of embedded metadata.

04:30.540 --> 04:37.340
And when we talk about removing traces via some way,

04:37.420 --> 04:43.020
I come away to do it with reductions and reductions are surprisingly or not surprisingly,

04:43.020 --> 04:50.860
very easy to get wrong. This is a famous case where Sony did a filing and they wanted to

04:50.860 --> 04:55.820
deduct how many millions of spend on their games etc and they used a black sharpie to do that

04:55.820 --> 05:01.900
but whenever you try to do reductions with something that is not opaque and that applies

05:01.900 --> 05:05.980
that of course applies to the sharpie but it also applies to other digital tools you're using

05:06.300 --> 05:12.620
your brush in your photo editor, some sort of spray or some a mosaic effect. All those things

05:12.620 --> 05:18.300
if they're not opaque they're bound to be forensically retrieved the text underneath and also

05:18.300 --> 05:25.340
that's like one of the types of bad reduction. There is also the very common bad reduction

05:25.340 --> 05:31.980
where you just take a PDF you add a opaque black bar on top of it and you forgot that it's just

05:31.980 --> 05:38.140
like an layer or top of the text and therefore that text can be retrieved again. So the mitigation here

05:38.140 --> 05:43.980
is to make sure you always use of course opaque black bars to remove sensitive information

05:43.980 --> 05:50.620
and make sure to flatten it afterwards. So as you can see I keep the word flattening gets an

05:50.620 --> 06:00.140
important defense against those types of things. Now there is another way to leave traces in a

06:00.140 --> 06:07.100
document and that's with physical watermarks. In this particular case I'm showing the watermarks

06:07.100 --> 06:14.620
that were in reality winners documents, reality winners, reality winner was a US whistleblower

06:15.820 --> 06:23.500
where she gave some documents to the intercept that journalist from the intercept gave them

06:23.500 --> 06:31.740
showed them to the FBI and then reality winner was called. Some people theorize that that reality

06:31.740 --> 06:37.900
winner was the anonymous due to those printer tracking codes that you see in blue there. In practice

06:37.900 --> 06:42.300
though and that's why I will not talk about the opposite failure here. Reality winner was one of the

06:42.300 --> 06:47.820
six people who had access to this particular document and the only person who contacted the

06:48.780 --> 06:57.500
intercept journalist through her work device. So even if we manage to remove those types of traces

06:57.500 --> 07:03.260
and my suggestion is to use OCR in order to just get the text and put it to another document

07:03.260 --> 07:09.020
you can never trust printed media. Even if you do that though there's bound to be

07:10.860 --> 07:16.220
some traces outside of the document itself but unfortunately we will not talk about that.

07:18.140 --> 07:26.220
And now there's another category that is also like pretty known. It's the subject of digital

07:26.220 --> 07:33.020
watermarks. Besides physical watermarking there's there are certain ways where and that's on the

07:33.020 --> 07:43.180
left you see what a company advertises to certain customers that you can embed somehow in a

07:43.180 --> 07:49.900
steganographic manner and certain tracking codes in pictures in various types of documents.

07:51.340 --> 08:00.140
For images it's like pixels. For the common text it can be zero width unicorn characters. Small

08:01.100 --> 08:08.860
small differences in the care link of the phone. It can be lots of very discrete stuff.

08:09.580 --> 08:17.420
And also besides those very cool paper worthy stuff on the left there is also like

08:18.220 --> 08:26.540
the play note techniques that have been used since the in called water or whatever where you just

08:26.540 --> 08:33.420
craft a document and you create different permutations with different maybe facts in order to

08:33.420 --> 08:38.380
and you give them to various people. And when something is leaked you can trace it back to the

08:38.380 --> 08:46.060
passion who elected. And although we can do some stuff for removing like the the left category

08:47.260 --> 08:54.700
for instance in order to make sure that you don't deal with with zero width characters and

08:55.500 --> 09:00.380
small changes in the document you can probably ocean our document and do a double translation

09:00.460 --> 09:05.740
translate your document to a separate language and then back to your own because some people used

09:05.740 --> 09:11.100
mispellings in order to do that or a different type of phrasing. You can never be able to

09:12.060 --> 09:18.940
remove like just plain old fabricated facts that may be in the document. So we're starting to see

09:18.940 --> 09:25.100
some categories where there is lack of limitation that the tool can never give protection.

09:25.340 --> 09:34.700
There is also another interesting category, a more active type of tracing which is kind of

09:34.700 --> 09:45.340
it tokens. Those are some sort of code that is embedded in the document and this code will be

09:45.340 --> 09:53.100
able to let's say form home whenever the document is opened. Now most same viewers will silently

09:53.100 --> 10:01.900
ignore those types of code in the NPDFs etc. But Microsoft Office will happily ask to enable

10:01.900 --> 10:08.780
macros and Adobe Acrobat as you see in the left screen. We will ask you if you want to connect

10:08.780 --> 10:16.460
to some random URL. In this case it's kind of a tokens.com but if it was a legit URL it's a

10:16.460 --> 10:24.780
high percent possibility that some person would just click on it and their IP and operating system etc

10:24.780 --> 10:29.180
will be revealed and the fact that the access to the document will be revealed to the source of

10:29.180 --> 10:36.140
the document. So a good defense here is to immediately flatten the document before once once you obtain it.

10:39.020 --> 10:44.620
And then there is a subject of fingerprinting which is different from watermarking. The

10:44.620 --> 10:51.900
this category has to do with not just the document itself but also the equipment you had in order to

10:51.900 --> 10:59.820
create that document and it turns out that your phone camera, your microphone has some unique

10:59.820 --> 11:08.780
characteristics that make it easier to associate different copies of different pictures let's say you

11:08.780 --> 11:16.220
take. So we call it fingerprinting because unlike watermarking where you just need one thing in order

11:16.220 --> 11:22.060
to trace it back to you with fingerprinting you need like a match for like if you leave a fingerprint

11:22.060 --> 11:28.140
on the scene you also need some somehow to match it with the person. So in that particular case

11:28.140 --> 11:34.700
if you take for instance a photo of some sort of document and then with the same phone you upload

11:34.780 --> 11:40.380
your family photo. In theory there's the same characteristic in there that can trace it back to you.

11:40.380 --> 11:49.340
So the advisor and we are again moving somewhere that shows are limited is to use disposable

11:49.340 --> 11:55.820
equipment and because you're writing style as well it can be fingerprinted and some people you

11:55.820 --> 12:04.220
have been found due to this way it's best to use common chords or do this double translation

12:04.220 --> 12:09.500
trick. And again in this particular case there's no way to be one hundred percent safe that

12:10.140 --> 12:18.460
the document you create will not have something from you. And finally besides the equipment besides

12:19.100 --> 12:27.180
the thing you capturing there's also something that can sip into the picture of the video the

12:27.180 --> 12:33.420
audio is the environment factor. And in this particular case you're seeing the iris of Japanese

12:33.420 --> 12:40.780
Pope Ido who was taking some selfies and posting them in social media and attacker a stalker

12:40.860 --> 12:48.860
managed to zoom into her iris and find out her surroundings and trace her route to a particular

12:48.860 --> 12:59.740
train and therefore keep made an attack in this particular case. So in order to be safer in this regard

13:01.180 --> 13:09.500
the suggestion is to avoid any types of selfies that may reflect the environment or avoid

13:09.500 --> 13:16.380
locations especially in case of audio where you cannot just do sound proofing something where it

13:16.380 --> 13:23.820
can be associated with you. But again there's no way to be one hundred percent safe.

13:26.220 --> 13:33.100
Now we saw some categories if you have anything that doesn't apply to those categories

13:33.180 --> 13:39.020
and mention of being happy to learn after the talk or as a question. Now we're going to show

13:39.020 --> 13:45.100
some tooling. Exitive tool probably most of you know it's a tool that can view

13:46.380 --> 13:53.100
metadata about file types. It can also remove metadata data as well and it can work with lots of files

13:53.100 --> 14:00.940
besides images which is a de facto thing. Here's how it does it with PDFs you can see some

14:01.900 --> 14:08.140
we can see that we can remove the metadata and then the end result is something that doesn't

14:08.140 --> 14:14.300
have any metadata in it. But there is a warning although the tool succeeded and although it

14:14.300 --> 14:21.740
seems that there are no metadata and that warning is actually xif2 does not remove any metadata in

14:21.740 --> 14:28.780
some file types. If you see here this PDF from the PDF from the left is the redacted one

14:28.780 --> 14:33.580
in this particular case xif2 just added its own section where it basically says there are

14:33.580 --> 14:41.420
no metadata on this file and this is not good. Exitive tool to their credit of course it's a

14:41.420 --> 14:46.620
very bad attitude so of course there are written limitations in the website but if you are doing

14:46.620 --> 14:52.620
something automatically or if you're not like superware of every limitation that is in there

14:53.180 --> 15:00.540
you may just miss it. Now for something more professional there is a map to alter it's an

15:00.540 --> 15:06.700
open source library it supports lots of documents it has a very thoughtful thought model about the

15:06.700 --> 15:12.780
things that can anonymize what not. There is a GUI option as well but unfortunately it's a

15:12.780 --> 15:21.260
make not maintained right now. What it does is that it takes it removes the data from some

15:21.260 --> 15:29.260
file types but it needs to flatten other file types known to be 100% safe. The end result is that you

15:29.260 --> 15:35.660
get back the exact same file that you provided in most cases where it's not flattened so that's

15:35.660 --> 15:43.580
very useful if you want to work with the end result and this is how much you does it you can see

15:43.580 --> 15:51.020
it detects the metadata it download it can remove them and then it shows that there's nothing

15:51.020 --> 15:57.420
else left however because the other side has a very thoughtful thought model the author of

15:57.420 --> 16:07.340
metadata of Matju claims that it's not 100% safe to remove complex metadata and then keep the original

16:07.340 --> 16:12.860
file because there may be some things that you don't anticipate the structure some files so there may

16:12.860 --> 16:18.460
be some update and then you miss some other type of metadata that's in there so you need to exercise

16:18.460 --> 16:25.500
some caution with tools that are not flattening the documents and now the tool but this presentation

16:25.500 --> 16:32.300
was like mostly about it's danger zone it supports lots of documents but unfortunately it does not

16:32.300 --> 16:43.100
support audio and video. The main idea is to sanitize malware but the but removing the metadata

16:43.100 --> 16:51.580
because of the fact that the weight works is that you pass the document in a container it breaks it

16:51.580 --> 16:57.260
down into pixels and then outside the container it reconstructs the original document as a pdf from

16:57.260 --> 17:04.140
these pixels so through this process there is no malware but also there are no metadata from the original

17:04.140 --> 17:10.140
document as the end result though is that it's a completely flattened so you cannot easily

17:10.140 --> 17:17.500
you cannot edit it or anything however you can ocean it's maintained by freedom of the press foundation

17:17.500 --> 17:24.460
and disclaimer I work on this project. That's the CLI option for the addition as you can see

17:24.460 --> 17:31.420
you can pass the document and you can also pass the ocean language in order to create a layer

17:31.420 --> 17:39.580
over the text. Now quick summary we have seen that metadata removal makes file

17:39.580 --> 17:47.020
saytable however it doesn't remove every denanimization vector we have seen two main tools for

17:47.020 --> 17:55.100
removing traces from your data danger zone match you match you provide produces a double files

17:55.100 --> 18:01.340
and it's definitely faster danger zone tries to be safe and flat as everything but does not

18:01.340 --> 18:08.300
provide something that's easy to work with and I have applied the tools and the category is an

18:08.300 --> 18:13.820
dimension before and we can see that there are clear limitations even in the case of the

18:13.820 --> 18:22.540
addition where the the user cannot be fully protected but in order to be protected though

18:24.380 --> 18:31.660
we need to help them a bit more and the x is that you saw before is like the gap the gap

18:31.660 --> 18:37.660
where the tooling cannot actually help you and you need to do something extra yourself so

18:37.740 --> 18:46.700
tools that flat in documents are good for a step but there is like but if they're not you can

18:46.700 --> 18:53.260
not remove everything however we know that we should lower and activists will still try to do

18:54.380 --> 19:00.780
to share documents that are sensitive and try to improve society this way and we need to

19:00.780 --> 19:04.700
help them in this situation we cannot just say that those are limitations and you cannot

19:04.780 --> 19:11.900
do anything else so what we need to offer in this particular case is something that goes beyond tooling

19:12.460 --> 19:18.780
it's something that goes beyond the technical stuff and it's basically playing out harm reduction

19:20.460 --> 19:27.820
it's basically advices that you can give to those people that go besides tooling and

19:27.820 --> 19:35.660
can't help them avoid some common issues that we know from the literature they're bound to be

19:35.660 --> 19:44.060
risks and this is what they're doing is at some point they're bound to be get caught by some

19:44.060 --> 19:49.340
law enforcement because there are lots of mistakes that you can do along the way but at least

19:49.340 --> 19:54.300
we should be able to educate them and so that they can make an informed choice about what they're doing

19:54.540 --> 20:03.900
and I'm going to go very quick from some types of documents and some examples some

20:03.900 --> 20:11.900
clear advice that we can give to whistleblowers and activists so for a video we do not support it

20:11.900 --> 20:19.660
the engine does not support it so prefer using matu if it was a recording if you record yourself

20:19.740 --> 20:25.020
please use disposable equipment and locations that are away from you if it's accessible only to you

20:25.020 --> 20:29.740
please be aware of digital watermarking and prefer not to go public without document and as always

20:29.740 --> 20:37.260
remember it's not my hand percent safe to share for images again disposable equipment if it's

20:37.980 --> 20:44.060
accessible only to you it may be digital watermarked so prefer not to go public if it's a

20:44.060 --> 20:50.940
print or a scan try to use the original files else again prefer not to go public and with reductions

20:50.940 --> 20:58.060
make sure you aid them with black opaque bars before you do anything with any tool and finally for

20:58.060 --> 21:04.940
documents because documents convey less information than pictures we have a bit more of a leeway there

21:04.940 --> 21:12.460
if it's a printed material make sure to wash out or retipe it manually a reductions black bars again

21:12.540 --> 21:18.940
if it's something that's accessible to you you can again it may be watermarked and may be watermarked

21:18.940 --> 21:24.620
in ways where the facts may be completed fabricated and can be traced back to you so that's an issue

21:25.260 --> 21:31.020
however this does not mean you don't need to do anything at least try to or share it to

21:31.020 --> 21:36.940
remove it any zero width factors etc and prefer and maybe do some double translation

21:37.820 --> 21:45.180
and for writing the documents just remember to use common words do the double translation

21:45.180 --> 21:52.380
quick and as always you may be think of it and in that case prefer not going public if it's like

21:52.380 --> 21:57.180
your document and again and this is something everyone needs to remember there's no one

21:57.180 --> 22:04.140
how to save a safe way to do this risky thing but you need to be educated so that's it for me

22:04.460 --> 22:10.620
thanks a lot for watching this sok and here's some links where you can learn more about the

22:10.620 --> 22:17.500
projects and I'm happy to take in questions

22:23.740 --> 22:30.140
your comments about printing was that we're trying to like if you print to a PDF so print for digital

22:30.140 --> 22:37.420
format and then to what what is actually written out is it just some sort of unique identify

22:37.420 --> 22:42.860
or is there a more information event so the question was what do we mean about printed material

22:42.860 --> 22:48.220
I'm referring to printing in a physical material as in a typical printer and the thing that

22:48.220 --> 22:54.380
goes out originally it was some tracking dots in modern times you cannot even say it it may be

22:55.100 --> 23:03.500
much more difficult if you print for a PDF is that not flattening up if you print a mean error

23:04.940 --> 23:14.300
say you just choose print to PDF yeah in the question is if you print a PDF to a physical material

23:14.300 --> 23:23.980
or if you do the print to PDF think yeah if it's creating a flat version if it's creating a

23:23.980 --> 23:30.380
flat version of a web page for instance no in that particular case it's not a flat version because

23:30.380 --> 23:35.980
PDF can contain complex file types in there as well they they may still remain under

23:35.980 --> 23:51.340
so it's not that much of a scale we've been interested to use the project like

23:51.340 --> 23:59.020
documents where you can do it all the art of all the time so you just get the actual thanks to

23:59.020 --> 24:05.100
can you repeat that project name please duckling so you're asking if it's

24:29.660 --> 24:36.780
the structure so it would be very easy because that doesn't happen and you make the data

24:38.140 --> 24:45.100
okay that's fine is it something that runs locally in that case okay then I guess I'll be an

24:45.100 --> 24:53.100
interesting option as well please

25:09.100 --> 25:13.900
yeah so there are some more complex ways to deal with that in order to get back

25:15.900 --> 25:23.100
oh yeah sure if the question was if it if it makes sense to if you have a PDF for instance

25:23.100 --> 25:28.940
to convert to postcript and then convert it back to PDF in that case removing the metadata

25:32.140 --> 25:36.940
sure maybe that's a way to do that I'm not sure if it removes everything I don't know

25:36.940 --> 25:42.700
for instance what happens with images that are in the PDF if those are the metadata of those

25:42.700 --> 25:47.900
images are removed as well what happens with audio recordings videos that may be in some

25:47.900 --> 25:53.980
PDFs you can embed lots of things there but the point is I didn't stress it out that much

25:54.540 --> 26:00.380
the point is that this is this tools should be used by people who are not technical they should be

26:00.380 --> 26:06.700
used by people who use windows macOS in our case it's just a good application and it needs to work

26:06.700 --> 26:13.340
with the office documents so we do the exact same thing everywhere and we do not take any

26:13.340 --> 26:18.460
chances of like what may leak or malatmid not

