WEBVTT

00:00.000 --> 00:14.400
So, this is a short version of a talk that I usually give about my journey from old school

00:14.400 --> 00:15.400
cloud developer.

00:15.400 --> 00:20.080
I contributed a lot of code to open stack, went to Kubernetes, yeah, yeah, yeah, okay.

00:20.080 --> 00:25.920
Recently I got bitten by the molecular biology bug and then that started in bioinformatics

00:25.920 --> 00:30.200
and doing a master of research in London recently, right?

00:30.200 --> 00:35.320
Of course, I don't pretend or expect to become a common expert in biology, but I like

00:35.320 --> 00:41.600
to bring old expertise I have in building clouds and stuff like that into this domain, right?

00:41.600 --> 00:48.120
So the project I did as part of this is what I'm going to highlight very quickly here.

00:48.120 --> 00:53.240
To begin with, we have two problems, which are, let's say, slightly relevant for this room,

00:53.240 --> 00:56.160
but it just gives an idea about what we're trying to achieve.

00:56.160 --> 01:01.680
So one of them is antigen specificity, meaning given an antibody sequence, will it bind

01:01.680 --> 01:03.800
to a given antigen?

01:03.800 --> 01:08.200
Just for simplicity, for example, the SARS-CoV-2 spike protein, that all of us are for

01:08.200 --> 01:13.920
25 million with from the pandemic, and it's a binary classification problem in machine learning

01:13.920 --> 01:19.120
terms, so meaning like, is it going to bind, yes, no, okay, in biological terms, it's not binary,

01:19.120 --> 01:20.120
but we simplify here.

01:20.520 --> 01:25.360
The other one is the school part of predictions, so given the whole antibody, which positions

01:25.360 --> 01:31.600
in that sequence of amino acids, right, are going to bind with the antigen, meaning

01:31.600 --> 01:37.240
the form body that we are trying to prevent to cause a trouble, right?

01:37.240 --> 01:42.160
And this one is a token classification problem, meaning that given a sequence of tokens,

01:42.240 --> 01:48.560
in this case, one for each amino acid, which one of them has that positive level or negative

01:48.560 --> 01:52.960
label, okay?

01:52.960 --> 01:59.360
It's, in doing this, we are comparing a large number of, pretty large number of existing

01:59.360 --> 02:05.480
models that we're going to fine tune for this particular task, okay?

02:05.480 --> 02:10.920
Well, you know, that resulted in around 600 tasks, okay?

02:10.920 --> 02:16.040
For both sequence and token classification, most of them requiring GPUs, so it's not something

02:16.040 --> 02:21.000
that you'll just run on your laptop, it's not that it requires a good amount of processing

02:21.000 --> 02:27.160
power, and that it requires also a good amount of orchestration, because all those tasks

02:27.160 --> 02:34.560
are changed, right, some of them, parallelized, of course, and we require a lot of attention

02:34.560 --> 02:38.400
in the ordering which random, so if you do it manually, besides taking forever, there is

02:38.400 --> 02:42.400
a very, very high chance that you're introducing human error.

02:42.400 --> 02:47.000
So both in terms of repeatability, but also reproducibility, about the researchers, it becomes

02:47.000 --> 02:48.520
a nightmare.

02:48.520 --> 02:55.560
So I went out and looked, hey, what can I use to automate all these things?

02:55.560 --> 03:00.160
And I looked, of course, for directed, acically graph solutions, which are very common, right,

03:00.160 --> 03:04.840
two very common ones out there are, for example, Apache workflow, that they used here,

03:04.840 --> 03:09.760
and another one, very common in the scientific world, is next law.

03:09.760 --> 03:13.400
Who heard about Apache workflow, to now?

03:13.400 --> 03:15.160
How many of you heard about next law?

03:15.160 --> 03:18.920
Okay, so both of them are very popular.

03:18.920 --> 03:21.800
And then, of course, that thing is just the orchestration layer.

03:21.800 --> 03:26.480
You need to choose an underlying platform to do that, right?

03:26.480 --> 03:31.960
Just learn, actually, at something which are very common, right, in the scientific world,

03:31.960 --> 03:34.080
in the HPC world.

03:34.080 --> 03:38.800
But coming from open stack Kubernetes and so on, from of course, Kubernetes in the significantly

03:38.800 --> 03:43.480
more natural way to do this, and I wanted to see if we can work very well also for this

03:43.480 --> 03:47.440
type of HPC use cases.

03:47.440 --> 03:53.160
So here it's a quick overview of how the antigen affinity pipeline looks like.

03:53.160 --> 03:54.960
There are two of them, of them, right?

03:55.040 --> 03:59.400
Again, this is a very short version of the talk, so I'm just focusing on one of the two.

03:59.400 --> 04:05.040
But you can see we have to begin with a lot of tasks that are needed to prepare the data set.

04:05.040 --> 04:11.440
So we start with a row file, a row archive file, basically with all the sequences.

04:11.440 --> 04:17.840
We start clustering in them, very important because we don't want to have duplicates,

04:17.840 --> 04:23.280
which in terms of sequences gets also quite complicated, so I'm not going to do the details.

04:23.280 --> 04:27.480
We need to split training of all the additional sets, then we have a completely separated

04:27.480 --> 04:32.640
test data set, which has to be independent from the other for data leakage considerations,

04:32.640 --> 04:37.120
which has to be clustered as well, potentially under sample if needed.

04:37.120 --> 04:41.760
And then we want to have a control in which we simply shuffle all the labels, so we want

04:41.760 --> 04:45.000
to see which are also the prediction we get on that.

04:45.000 --> 04:49.680
And then for each of the models and we have a lot of them, for those of you familiar with

04:49.680 --> 04:55.120
the domain, Antibiot, Antibiot, 2, yes, and 2, and so on, yes, and 2, maybe familiar with you guys

04:55.120 --> 04:57.360
to come from the metagrop.

04:57.360 --> 05:03.760
Some of them are very large models, the biggest one I'm used here has 15 billion parameters.

05:03.760 --> 05:08.360
And for all of them, you repeat the stuff which is in the blue box, the yellow, smaller

05:08.360 --> 05:10.920
boxes are the task which required GPUs.

05:10.920 --> 05:16.520
So the core of that involves finding tuning models, getting for intunes potential ways and

05:16.520 --> 05:20.040
stuff like that, okay.

05:20.040 --> 05:24.920
At the end of it, we have completely separate tasks which are reports for intuninar and

05:24.920 --> 05:27.680
if everything goes well, you get an email with the results.

05:27.680 --> 05:31.560
So the idea is that you start this thing in the evening, takes around six hours on a server

05:31.560 --> 05:36.120
that I used with two, eight hundreds on that, two and a village hundreds.

05:36.120 --> 05:39.200
And by the morning, you get an email, right?

05:39.200 --> 05:43.960
It's also made in a smart way, so that if a task doesn't need to be run, because I already

05:43.960 --> 05:47.160
have a process, for example, that particular part of the pipeline, it doesn't repeat

05:47.160 --> 05:50.000
the whole thing, right?

05:50.000 --> 05:53.200
And the pipeline is made to be run automatically when every have changed in the data

05:53.200 --> 05:57.520
sets, changing the code, and what not.

05:57.520 --> 06:06.000
Now I'm trying to do a live demo, which is something relatively rare in a line in talks,

06:06.000 --> 06:10.520
so this is the Apache Airflow interface, okay.

06:10.560 --> 06:18.120
I take, for example, one of the prediction tasks, and I clear it, okay.

06:18.120 --> 06:25.720
This will tell, if the network works, it will tell Airflow to start riskadowing that

06:25.720 --> 06:29.320
particular tasks, so not all of them, just that one, because it's, we don't have the

06:29.320 --> 06:30.320
time.

06:30.320 --> 06:35.520
That will spin that, of course it will contact Kubernetes, to the operators behind and

06:35.520 --> 06:41.160
execute or have, and Kubernetes will start scheduling a container, right?

06:41.160 --> 06:45.320
The big difference here is that, compared to SLR, for example, Kubernetes doesn't really

06:45.320 --> 06:50.160
have a scheduling solution, which works very well for this type of task, right?

06:50.160 --> 06:54.080
That part of the work is, of course, entirely of loaded in this case to Airflow, which

06:54.080 --> 06:58.720
does a pretty good job with that, keeps on retrying, basically, and scheduling a maximum number

06:58.720 --> 07:02.360
of tasks based on what your configurations are.

07:03.280 --> 07:08.040
It's already finished, and I can look here, so this is a big advantage, for example, compared

07:08.040 --> 07:12.720
to things like next flow, because all the output from my containers are coming out here,

07:12.720 --> 07:13.720
right?

07:13.720 --> 07:22.080
When that is finished, it goes, it goes here, and starts, it sees that it already collected

07:22.080 --> 07:28.160
everything else, so it's going to go through the, through, the next task, which in

07:28.160 --> 07:33.040
both, for example, running all the reports and everything, here, this is the R code,

07:33.040 --> 07:38.960
generating an RMD, and at the end of it, it's going to send an email, okay?

07:38.960 --> 07:46.680
So if all went well, I'm going to have an email here, which just arrived with all my results.

07:46.680 --> 07:52.480
I also simplified it for simplicity, running only two models here instead of all of them.

07:52.520 --> 07:58.880
I click on it, and I get all these nice metrics, you know, which show me comparisons

07:58.880 --> 08:00.960
between all the models, okay?

08:00.960 --> 08:11.680
In this case, against only two models, because simplicity, but this is all the whole thing,

08:11.680 --> 08:16.800
looks like comparing, comparing all of them, you see, all the models on the X-axis,

08:16.840 --> 08:23.040
and the rest, so for example, recall, FPR, F1, average precision, and so on, so it's very

08:23.040 --> 08:27.760
useful enough for running this stuff.

08:27.760 --> 08:38.220
Moving on, in terms of architecture, you have three main components, you have an F-Lome

08:38.220 --> 08:42.000
rest of the PI, your DAX, which are running, you have a schedule and a queue, which runs

08:42.000 --> 08:44.360
on the work, which is a third component.

08:44.360 --> 08:48.760
If you compare it with next flow, next flow will basically be the scalar class worker

08:48.760 --> 08:51.400
part, right?

08:51.400 --> 08:52.400
We saw this thing.

08:52.400 --> 08:55.320
An important consideration is relatively to storage.

08:55.320 --> 09:02.840
We have here every single container, which gets bone, running each individual task, needs

09:02.840 --> 09:05.120
to have access to single share storage.

09:05.120 --> 09:06.120
How do you do that?

09:06.120 --> 09:11.160
Well, in Kubernetes, you have so-called CSI's, you know, the driver's specific for storage,

09:11.160 --> 09:16.760
we need a pick one that has read right many, meaning that you can share the same PVC,

09:16.760 --> 09:17.760
right?

09:17.760 --> 09:22.120
The persistent volume claim, two, all of those containers, which are running in parallel,

09:22.120 --> 09:27.160
so that they can share both the code, the storage, and so on.

09:27.160 --> 09:35.480
Two great examples in that case are a set for production, or NFS for POCs, right?

09:35.560 --> 09:42.800
So, both of them work pretty well, and they solve the problem, of course, set is more.

09:42.800 --> 09:46.640
It's better for production sinkers, use case.

09:46.640 --> 09:48.720
Optimizations, you have a cool them.

09:48.720 --> 09:52.720
Again, here I'm not entering in the details, but I'm using timeslizing for this particular

09:52.720 --> 09:53.720
case.

09:53.720 --> 09:56.640
I'm sharing my two GPUs between all those models.

09:56.640 --> 10:00.600
But you can do meek if you want more security, more multi-tenancy, or you can use

10:00.600 --> 10:08.240
NPS, but it's currently in a experimental stage in the Kubernetes device plugin, right?

10:08.240 --> 10:12.880
And then for scaling, since you want to share the model across multiple GPUs, you want

10:12.880 --> 10:17.400
to have, of course, having phase accelerates, which I'm using from Mmodos, and deep speed,

10:17.400 --> 10:23.000
which basically allows you to split your models across multiple GPUs in across multiple

10:23.000 --> 10:24.000
nodes.

10:24.000 --> 10:30.920
The three degrees there, one, one, two, three, one allows you to do less sharing, but

10:30.920 --> 10:35.480
it's less problematic to configure, and the third one allows you to do more distribution

10:35.480 --> 10:39.280
of the data, but it's more complicated to configure.

10:39.280 --> 10:43.560
But you saw here, uses the level three.

10:43.560 --> 10:48.840
Last but not least, I think that I will explore more in, for the talk in the future, because

10:48.840 --> 10:52.320
I'm currently porting this pipeline, or so to next floor.

10:52.320 --> 10:55.080
You often people ask me, which is better, if lower next floor.

10:55.080 --> 10:58.760
In reality, there is no better option.

10:58.760 --> 11:05.880
It's just that very shortly, one is Python, the one is Ruby, one uses a, let's say, very

11:05.880 --> 11:10.000
opinionated DSL, the other one is just Python code.

11:10.000 --> 11:15.040
One has an easy learning curve, next floor, it's a bit steeper.

11:15.040 --> 11:20.640
Next floor, it's more HP-serient, if you want, but most important, next floor has an extremely

11:20.720 --> 11:25.560
strong, and if community, which is an F core, on the biology side, which is not present

11:25.560 --> 11:30.720
of course, on air flow, and in my opinion, for example, coming also from the open stack

11:30.720 --> 11:35.200
experience, community is where more important than code, if there is a community, you

11:35.200 --> 11:36.720
can fix the code, right?

11:36.720 --> 11:41.400
So that's why I think it's very interesting to port now this pipeline to next floor,

11:41.400 --> 11:43.600
and see what's coming next.

11:43.600 --> 11:44.600
Thank you.

11:44.600 --> 11:46.600
I'm done with this.

11:47.520 --> 11:52.520
We have time for one question.

11:52.520 --> 12:08.520
Yeah, while the next speaker can line up, one question, yes, that's the next step.

12:08.520 --> 12:14.920
Okay, that was a quick question, I should repeat it, am I going to port this after

12:14.960 --> 12:18.120
the next floor, so when F core, yes, that is the plan, okay?

12:18.120 --> 12:19.400
All right, thank you so much, guys.

