WEBVTT

00:00.000 --> 00:11.000
All right, can everybody hear me just fine?

00:11.000 --> 00:12.000
Yeah, good.

00:12.000 --> 00:14.000
All right, so hi.

00:14.000 --> 00:15.000
I'm Dimitri.

00:15.000 --> 00:19.000
I'm a PC student in the CA in Paris.

00:19.000 --> 00:24.000
And I'm going to present some work, some current work that I've been working on with Michael

00:24.000 --> 00:25.000
who's just there.

00:25.000 --> 00:27.000
And Stephanie was a curly.

00:28.000 --> 00:34.000
So as I'm sure you may know, there's been a bunch of recent backdoor attacks targeting the supply chain.

00:34.000 --> 00:37.000
So there are two sort of goals here.

00:37.000 --> 00:41.000
Either you compromise the project directly, and you get access to that project,

00:41.000 --> 00:45.000
or you compromise some core project that has a lot of dependence on it.

00:45.000 --> 00:49.000
And through that project, you pollute the entire supply chain.

00:49.000 --> 00:55.000
So the attack factors that we've seen in real life is either people trying to push some malicious commit

00:55.000 --> 01:01.000
as was the case with beef feed people just hijacked some get accounts and managed to push commits.

01:01.000 --> 01:09.000
Or you go the tarble route where you basically replace a legitimate release tarble by an infected one containing a backdoor.

01:09.000 --> 01:13.000
And that's what happened with VSFPD and pro 3D.

01:13.000 --> 01:16.000
And then you have exit utils, which is kind of both, right?

01:16.000 --> 01:20.000
And it's even worse because somebody who is supposed to be a trusted maintainer.

01:21.000 --> 01:27.000
So these were all detected up to a few days after injection, which is already not bad,

01:27.000 --> 01:31.000
but it was largely due to luck and just sheer manual effort, right?

01:31.000 --> 01:35.000
So we need to have something better than this to catch these things.

01:35.000 --> 01:38.000
So how do we detect backdoor's in general?

01:38.000 --> 01:42.000
Well, the historically of doing it is basically manual reverse engineering.

01:42.000 --> 01:47.000
So there's not really a simple algorithm you can use to do it.

01:47.000 --> 01:50.000
You kind of just have to dig through a binary.

01:50.000 --> 01:54.000
And there are tools to help with automating parts of that.

01:54.000 --> 01:58.000
But we talked about this yesterday.

01:58.000 --> 02:04.000
Basically a year ago we introduced a new approach that's based on fuzzing that largely automates this.

02:04.000 --> 02:10.000
So basically the way that approach works is you collect a set of representative inputs.

02:10.000 --> 02:12.000
So those are common inputs to a program.

02:12.000 --> 02:16.000
And for every new input you compare it with the representative inputs.

02:16.000 --> 02:17.000
And behavior is different.

02:17.000 --> 02:20.000
You signal suspicious input because that could be a backdoor.

02:20.000 --> 02:24.000
However, all of these are after the fact detection, right?

02:24.000 --> 02:28.000
So at that point the backdoor is already in the release and the binary is being distributed.

02:28.000 --> 02:29.000
So it's kind of too late.

02:29.000 --> 02:32.000
Like people might have been impacted already.

02:32.000 --> 02:35.000
How do we prevent bugs from ever making into releases?

02:35.000 --> 02:38.000
Well, one way actually is to use fuzzing in CI.

02:38.000 --> 02:40.000
This is already in use.

02:40.000 --> 02:45.000
Just for those of you who don't know fuzzing is a brute force automated test generation approach.

02:45.000 --> 02:48.000
Or you basically spam your target program with a bunch of inputs.

02:48.000 --> 02:50.000
And you're hoping to make it crash, for example.

02:50.000 --> 02:53.000
And that might be a symptom of a vulnerability.

02:53.000 --> 02:57.000
So this already exists for many big oversource projects that you know.

02:57.000 --> 03:01.000
And the reason why is that it's highly automatic.

03:01.000 --> 03:02.000
So it's pretty easy to set up.

03:02.000 --> 03:05.000
And you're going to just let it do this thing.

03:05.000 --> 03:07.000
There's an extremely low fuzzing rate.

03:07.000 --> 03:09.000
So usually it's basically zero.

03:09.000 --> 03:13.000
If you have a crash then that in and of itself is a problem in most cases.

03:13.000 --> 03:16.000
And also it can work with constraint resources.

03:16.000 --> 03:19.000
So even in a CI job in a very short CI job.

03:19.000 --> 03:21.000
You can still get fuzzing to work.

03:21.000 --> 03:25.000
But this can only discover basically crash type vulnerabilities.

03:25.000 --> 03:31.000
So why don't we just apply the approach we talked about yesterday and the security

03:31.000 --> 03:33.000
Deverem in the CI.

03:33.000 --> 03:35.000
Surely you can just plug it in actually no.

03:35.000 --> 03:37.000
So there are a bunch of problems.

03:37.000 --> 03:39.000
So first of all, it's way too slow.

03:39.000 --> 03:42.000
So the approach, if you caught the talk yesterday.

03:43.000 --> 03:46.000
Is targeting binary only programs.

03:46.000 --> 03:49.000
So there's a huge like emulation overhead related to that.

03:49.000 --> 03:51.000
And so you can't just run that in CI.

03:51.000 --> 03:53.000
You can't run in 10 minutes.

03:53.000 --> 03:57.000
And the bigger problem actually is the false positives.

03:57.000 --> 04:01.000
So it's not that the tool produces many false positives.

04:01.000 --> 04:04.000
It's that it produces false positives basically every time.

04:04.000 --> 04:07.000
And this is obviously non-tractable for the CI pipeline.

04:07.000 --> 04:09.000
You can't have your CI telling you there's a back door.

04:09.000 --> 04:11.000
Every other commit when there's none.

04:12.000 --> 04:14.000
So we propose a new approach here.

04:14.000 --> 04:17.000
So we observe that there's a rolling effect in CI, right?

04:17.000 --> 04:21.000
So you have your version n and plus one and plus two and so on.

04:21.000 --> 04:26.000
So instead of just treating them independently as the state of the arts tool to do,

04:26.000 --> 04:32.000
why don't you just use the history basically of the project to get some more information.

04:32.000 --> 04:35.000
So for example, if you assume that the previous version contains no back doors,

04:35.000 --> 04:37.000
then you can do a bunch of stuff.

04:37.000 --> 04:41.000
First of all, you can use its fuzz or generated inputs as a representative of inputs.

04:41.000 --> 04:45.000
And that gives you a bunch more behaviors to compare to.

04:45.000 --> 04:51.000
And then you can go even further and you can say that if I observe something being weird,

04:51.000 --> 04:56.000
quote unquote based on these characteristics in both versions, then it's probably fine.

04:56.000 --> 05:01.000
Because since the previous version is back to our free, like the behavior is the same.

05:01.000 --> 05:04.000
But if I only observe it in the new version, then there's a problem.

05:04.000 --> 05:07.000
So let's go over an example so you get what I mean.

05:07.000 --> 05:14.000
So PHP ships with a built-in HTTP server that actually was attacked via backdoor in 2021.

05:14.000 --> 05:19.000
And we're going to fuzz it with this method to try to detect the backdoor.

05:19.000 --> 05:26.000
So at some point the fuzzer will generate a request to the server with a huge value for content length.

05:26.000 --> 05:27.000
So that's a header.

05:27.000 --> 05:32.000
And basically, as you can imagine the server, we'll just refuse to treat the request.

05:32.000 --> 05:36.000
Like you're shooting me with like a 400 terabyte size request.

05:36.000 --> 05:38.000
I can't treat that so no.

05:38.000 --> 05:41.000
So that leads to different system calls on the server side.

05:41.000 --> 05:46.000
And this would have been saved as suspicious by the existing tools, right?

05:46.000 --> 05:50.000
But if we run it in the previous version of the same HTTP server,

05:50.000 --> 05:52.000
it will react in the same way, right?

05:52.000 --> 05:54.000
This is legitimate behavior actually.

05:54.000 --> 05:59.000
So since both react in the same way, we can safely prune the input away.

05:59.000 --> 06:01.000
And we can remove what would have been a false positive.

06:01.000 --> 06:06.000
So this method allows us to drop the false positive rate by a lot.

06:06.000 --> 06:09.000
And of course, as I said before, speed is still an issue.

06:09.000 --> 06:14.000
Since before we were targeting binaries from firmware, from multiple platforms,

06:14.000 --> 06:18.000
we were largely using Q and U to be able to handle that.

06:18.000 --> 06:19.000
So we were emulating.

06:19.000 --> 06:22.000
This is nice, but it's also extremely slow.

06:22.000 --> 06:26.000
So instead of doing that now, since we're on the developer side,

06:26.000 --> 06:29.000
and we have access to the source code, we'll just use AFL plus pluses,

06:29.000 --> 06:33.000
compiler paths to inject further instrumentation into the source code directly.

06:33.000 --> 06:38.000
So that gives us a much higher throughput for the fuzzard which speeds things up.

06:38.000 --> 06:41.000
So we need to validate this to make sure it works.

06:41.000 --> 06:46.000
So first of all, we need to evaluate whether back to detection has been impacted.

06:46.000 --> 06:51.000
And the way we do that is we test on three real world attacks to which have CVEs attached to them.

06:51.000 --> 06:58.000
And 10 synthetic attacks were basically we created back to recent injected them in popular open source projects.

06:58.000 --> 07:02.000
And then we also want to test false positive filtering.

07:02.000 --> 07:06.000
So we can't possibly test on every commit in the history of the repo.

07:06.000 --> 07:07.000
It's kind of too much.

07:07.000 --> 07:11.000
So what we do is we borrow this approach from another paper actually.

07:11.000 --> 07:15.000
And what we do is we cluster commits and sequences.

07:15.000 --> 07:17.000
So you have sequences of three commits.

07:17.000 --> 07:22.000
And then you compute how many files were changed in a sequence and how many lines.

07:22.000 --> 07:24.000
You take the average of the three commits.

07:24.000 --> 07:26.000
And then you put them in buckets.

07:26.000 --> 07:28.000
So small medium high.

07:28.000 --> 07:30.000
And you take all possible combinations of the buckets.

07:30.000 --> 07:34.000
So that gives you a set of representative commits where you have many files changed.

07:34.000 --> 07:37.000
A few files changed, many lines, a few lines so on and so forth.

07:37.000 --> 07:43.000
We do this twice because many commits actually modify things that do not alter the behavior of the program,

07:43.000 --> 07:45.000
such as documentation, stuff like that.

07:45.000 --> 07:49.000
So one time we just do it indistinguishably throughout the entire history.

07:49.000 --> 07:54.000
The second time we only pick commits that actually alter source code files.

07:54.000 --> 07:59.000
And that gives us four hundred and three to commit pairs to evaluate on throughout the

07:59.000 --> 08:02.000
thirteen different programs that we use.

08:02.000 --> 08:04.000
But actually we can go even further.

08:04.000 --> 08:12.000
So what if we add in another layer of security where if something gets past the project level check,

08:12.000 --> 08:16.000
we can maybe catch it up the district or a little check or the packaging check.

08:16.000 --> 08:21.000
So for example, when a debut in release manager wants to make new release,

08:21.000 --> 08:26.000
they can run the same like ten minute tool on very sensitive packages they say.

08:26.000 --> 08:32.000
And that way there's another net basically that we can cast to catch any factors that might have slipped through.

08:32.000 --> 08:36.000
So in order to test that, we also took the last three debut in a bunch of releases.

08:36.000 --> 08:42.000
And that yields another fifty release pairs where we can test on with much bigger difts this time.

08:42.000 --> 08:49.000
And we also apply the back doors on top of these just to test how the back door detection works in the case of releases.

08:50.000 --> 08:56.000
So grant total 482 version pairs for the commit filtering and 63 for the back doors.

08:56.000 --> 09:00.000
It's more than 500 different version pairs basically to test on.

09:00.000 --> 09:05.000
So the results are that all back doors are still detected as they were with the previous tool.

09:05.000 --> 09:07.000
So we lose nothing on that end.

09:07.000 --> 09:13.000
And actually the detection rate goes up because of the optimizations with the source mode fuzzy.

09:13.000 --> 09:15.000
And this is for ten minute fuzzy conveys.

09:15.000 --> 09:18.000
So this is nothing basically in terms of time.

09:18.000 --> 09:22.000
And false positive wise we have an extremely low rate as well.

09:22.000 --> 09:29.000
So basically only 17 runs out of more than 8,500 had a false positive and had a single one.

09:29.000 --> 09:40.000
And actually we also added a tool that can generate a report showing which lines cause the suspicious behavior or the suspicious system calls.

09:40.000 --> 09:45.000
So you can find yourself in the situation where you have multi thousands of lines.

09:45.000 --> 09:51.000
That gets boiled down to two lines that emit suspicious system calls and as a developer it's much easier to check.

09:51.000 --> 09:58.000
So yeah, in conclusion, back to prevention is possible with an extremely low overhead because you're reusing fuzzy results.

09:58.000 --> 10:05.000
90% detection rate on our benchmark with only 0.2% false positive rates across the entire thing.

10:05.000 --> 10:06.000
Thank you very much.

10:06.000 --> 10:16.000
We have time for a few questions.

10:16.000 --> 10:18.000
Let's go.

10:18.000 --> 10:20.000
So thanks for the call.

10:20.000 --> 10:26.000
We've done a probe to be like included in some repositories.

10:26.000 --> 10:30.000
Our history is like actually history is an absolute nation.

10:30.000 --> 10:33.000
But there is more than institutions at all.

10:33.000 --> 10:41.000
So yeah, the idea is to pitch this to both developers with big open source projects and also destroy maintainers release managers basically.

10:41.000 --> 10:45.000
So you can have multiple layers that check at different intervals.

10:45.000 --> 10:51.000
So a project might check commit by commit, but the release manager might check once every two years or something.

10:51.000 --> 10:55.000
So it's harder to get past both filters at the same time.

10:55.000 --> 10:59.000
And so why are we kind of check for that?

10:59.000 --> 11:07.000
So it's not out yet, but well, if you look at my site or something, I will release it as soon as it's out.

11:07.000 --> 11:11.000
Okay, one more question.

11:11.000 --> 11:16.000
Is language specific or just.

11:16.000 --> 11:18.000
So well, there's a caveat.

11:18.000 --> 11:20.000
So fuzzing, you can do it in any languages.

11:20.000 --> 11:22.000
There are many fuzbers that target many languages.

11:22.000 --> 11:27.000
The fuzbers that we're using for this prototype targets CNC++ programs.

11:27.000 --> 11:31.000
But it's fairly easy to find another fuzzer for your language of choice.

11:31.000 --> 11:35.000
Plug it into our system, our system can take any fuzzer basically.

11:35.000 --> 11:38.000
So it works based on the same premise.

11:38.000 --> 11:45.000
Basically, as soon as you have a system that uses Cisco's, which means basically everything, you can detect back to us this way.

11:45.000 --> 11:50.000
Okay, a few more questions.

11:51.000 --> 11:53.000
Show first the next.

11:53.000 --> 11:57.000
You were saying that you are keeping the fuzzing between one run and the other.

11:57.000 --> 11:59.000
So how much space that?

11:59.000 --> 12:03.000
Because you are kind of trading speed with this space at this point.

12:03.000 --> 12:04.000
Sure.

12:04.000 --> 12:06.000
So the thing is, you don't have to save all the history.

12:06.000 --> 12:09.000
You just have to save the last one.

12:09.000 --> 12:14.000
So sure that might be, I don't know, a few thousand to tens of thousands of test cases.

12:14.000 --> 12:17.000
But those are like small text files basically.

12:17.000 --> 12:19.000
But you only need to save the previous one.

12:19.000 --> 12:22.000
The system does not require a longer history than that.

12:22.000 --> 12:24.000
If you wanted to be really careful, you could save more.

12:24.000 --> 12:27.000
But you can just save the previous one, it will work.

12:27.000 --> 12:28.000
Oh, thank you.

12:28.000 --> 12:29.000
Thank you.

12:29.000 --> 12:30.000
Question of the back end.

12:30.000 --> 12:31.000
Go ahead.

12:31.000 --> 12:34.000
I was wondering if this two or two, like those will help.

12:34.000 --> 12:37.000
Back for designers, maybe more, they don't get worse.

12:37.000 --> 12:41.000
So the benchmark that we use, we can talk about a layer if you want.

12:41.000 --> 12:42.000
It's pretty adversarial.

12:42.000 --> 12:48.000
So we actually, in the paper, we're writing about different attack models against this method.

12:48.000 --> 12:51.000
And from our evaluation, they don't really seem to work reliably.

12:51.000 --> 12:54.000
So you have things that can work a small percentage of the time.

12:54.000 --> 13:00.000
But like, you cannot guarantee that this will get past multiple filters down the line.

13:00.000 --> 13:03.000
Maybe one last question.

13:07.000 --> 13:10.000
Going twice, last question.

13:10.000 --> 13:11.000
No.

13:12.000 --> 13:14.000
I think you're asking, maybe if we're looking.

13:14.000 --> 13:17.000
So it will not be a hatred that you can test with the possibilities.

13:17.000 --> 13:23.000
I guess some sense will still live with the bio-proof exit data from the input.

13:23.000 --> 13:24.000
Brown is our.

13:24.000 --> 13:25.000
Sure.

13:25.000 --> 13:26.000
Modern buzzers?

13:26.000 --> 13:27.000
Yeah.

13:27.000 --> 13:28.000
Sorry.

13:28.000 --> 13:29.000
Sure.

13:29.000 --> 13:32.000
They're like fuzzing blocking constructs.

13:32.000 --> 13:36.000
For example, if you put a crypto check like in XUTILS, a buzzer can't get past that.

13:36.000 --> 13:40.000
But neither can a symbolic execution engine or any dynamic analysis.

13:40.000 --> 13:42.000
I would like the point of crypto.

13:42.000 --> 13:44.000
So yeah, you're limited by that.

13:44.000 --> 13:49.000
However, if you do that, you then have a huge blank spot in your coverage.

13:49.000 --> 13:52.000
So you have as a contributor to the project.

13:52.000 --> 13:55.000
You have to explain why what you added cannot be covered.

13:55.000 --> 13:58.000
So for the attacker is complicated.

13:58.000 --> 13:59.000
Thank you.

13:59.000 --> 14:00.000
Thank you.

14:00.000 --> 14:02.000
Thank you.