WEBVTT

00:00.000 --> 00:13.000
Hello, to everyone, first I would like to thank the guys for giving me the opportunity to talk in this excellent conference.

00:13.000 --> 00:23.000
My name is Jose Spinozacarrasco, I'm working in the LAP of Notre Dame's LAP at the CERG in Barcelona. This is the LAP where NEXO was originally developed.

00:23.000 --> 00:36.000
I'm also an NFC or team member, and today I'm going to introduce you an NFC or platform, which is a standardized pipeline for HAP for Plotting Statue of Nature Methods.

00:36.000 --> 00:49.000
So maybe if you don't know already, you may be wondering why it's interesting to predict Plotting Statue and the thing is that once you have the Statue of Plotting you can infer its function.

00:49.000 --> 01:03.000
And once we know the function of this Plotting, we can do some applications, like for instance, the Stabilized Statue for Biatenological Applications, so that they work in higher temperatures with higher performance.

01:03.000 --> 01:24.000
We can do things like design, for acting in its active side, in a intelligent way, or we can do more basic signs and we can study the evolution of this given Plotting, comparing its structure with other Plotting families and species.

01:24.000 --> 01:39.000
And it's possible to get the Statue of Plotting using a experimental method, but the thing is that as you may imagine, this is very time consuming and also very expensive.

01:39.000 --> 01:49.000
And that's why it has been a long-standing program in computational structure biology, how to predict structures using computational methods.

01:49.000 --> 02:05.000
So that there's even a competition, which is called the CAS contest, and during this contest, the Stabilized Statue of Plotting families gather and they compare their methods, and one of them, up to the other, and it wins in different data sets.

02:05.000 --> 02:24.000
And this is in the framework of this contest, it was when, for the first time, a photo show that AI-based methods can outperform existing methods and not only that, but they can, alpha photo was able to achieve almost experimental accuracy.

02:24.000 --> 02:52.000
So this was a big breakthrough in biology, and it was awarded the 2024 chemist Nobel Prize, as you can see here, but more in practical terms, what I'm showing you here in this ball or in this circle is how many structures were available in the PDV database, which is the database where people were were the position, the structures of the Plottings that were played at the experimentally.

02:52.000 --> 03:01.000
And after the alpha photo, inception, you can see how the number of structures a bio has employed.

03:01.000 --> 03:11.000
And indeed, also the number of tools that can predict structures using a-based methods has employed.

03:11.000 --> 03:20.000
Here, you have several of them, and we have mainly two categories, most of them, as you can see, are based in evolutionary research, DBM models.

03:20.000 --> 03:29.000
And this means that these tools, what they do is that they search for similar sequences in databases, and then they perform multiple sequence alignment.

03:29.000 --> 03:38.000
And with these multiple sequence alignment, what they want to achieve is to know which parts of the protein of the sequence of the protein are evolutionary concerns.

03:38.000 --> 03:49.000
These parts of the protein are important for the function, and they tend to be maintained in evolution for being absorbed at the protein can still have the function.

03:49.000 --> 04:01.000
And after this, with this information, and with this sequence, this is applied to an neural network, which is the one that infers the structure, and after several steps of refinement.

04:01.000 --> 04:18.000
We have for their tools that use large language models, like ESM-4, and in this case, these tools are faster, but sometimes dispenses of accuracy, mainly when dealing with long sequences.

04:18.000 --> 04:25.000
So, as always, in competition, there is a trade-off between speed, accuracy, and those cost.

04:25.000 --> 04:35.000
Yes, I wanted to say something else. I got lost with it, I mean, sorry.

04:35.000 --> 04:48.000
And of course, users, depending on what they want to do, they will use one of the tools, or they will like to use several of these tools, and so on and so forth.

04:48.000 --> 04:59.000
But actually running these tools is not to stay forward, because they try to provide you the images to make it done, but even though it's hard.

04:59.000 --> 05:23.000
And why it's hard, because all of these tools, they rely in several libraries, different versions of the libraries, so you have to make all them available in your environment, and also the tools that are based, that need the MSA step, they also rely in very big databases that you have to have in your environment.

05:23.000 --> 05:41.000
And this is why we implemented NF Core Plotting Fold, here you have the typical NF Core Metal Map, and you can see here in this first, to go through that this pipeline allows you to download all the databases that you need for these tools.

05:41.000 --> 05:51.000
And this is a one-shot workflow, this means that you don't need to run it each time that you run the pipeline, you run it once, and then you can provide it to the pipeline with the human parameters databases.

05:51.000 --> 06:00.000
And then you can see the alternative paths that you can follow in this pipeline, this is the development version, by the way.

06:00.000 --> 06:12.000
So here on the top, you have tools that are based in evolutionary driven tools.

06:12.000 --> 06:23.000
In this case, these tools perform both the MSA step and the prediction step in the same process, so you can not split them.

06:23.000 --> 06:40.000
Then here you have these two other paths, the one in green, and the one in yellow, the green, it's alpha-fold 2, alpha-fold 2 doesn't allow to do it natively, this separation between the MSA step and the prediction step, but we have modified to be able to do it.

06:40.000 --> 06:48.000
And also, call up for MSA, call up for Lamboz, that in this case this is native, and we just take advantage of it.

06:48.000 --> 06:54.000
And here you have ESM Fold, which is the one that it's using the large language model approach.

06:54.000 --> 07:08.000
So the pipeline is giving you the PDVs, the structures of the proteins, but within this part is also very interesting, this is something that we have been working on is that for each tool you get a report with all the information of the accuracy.

07:08.000 --> 07:18.000
But also you also get another report which allows you to compare the performance of the different tools that you have used in your run.

07:18.000 --> 07:28.000
And here I am just summarizing some of the futures of these pipelines as I show before it allows you to plan on load the databases and the parameters of the models.

07:28.000 --> 07:36.000
Everything is containerized, so it means that you don't need to install any of these tools or its dependencies.

07:36.000 --> 07:50.000
It's easy to configure, and this is important for these tools that allow you to run the prediction step, separated from the MSA step, because MSA step does not get any advantage of being run in CPUs.

07:50.000 --> 07:53.000
This way you can configure it to run in CPU and CPU.

07:53.000 --> 08:05.000
And this is something that it's actually, can be done thanks to NXO because NXO separates the configuration from the logic of the pipeline.

08:05.000 --> 08:10.000
So it's just need to provide a configuration file and you get it.

08:10.000 --> 08:17.000
And as I said, as I mentioned before, it allows you to do methods of benchmarking.

08:17.000 --> 08:23.000
And actually this is also interesting because this is something that we are pushing in our lab is half pipelines.

08:23.000 --> 08:31.000
This means pipelines that perform that where you have several tools that perform the same thing.

08:31.000 --> 08:43.000
This will allow users to compare these tools among them, but also the developers of the tools or people that are interested in comparing the performance of these tools to it using these pipelines.

08:43.000 --> 08:52.000
It have two reporting capabilities that I already mentioned, and it's part of NF Core which have several embedded as I will show you.

08:52.000 --> 08:56.000
Here in this slide, I'm showing the comparison report.

08:56.000 --> 09:01.000
Each of the colors depends one of the tools that we have used to predict this structure.

09:01.000 --> 09:04.000
This is a sequence that we use for the testing of the pipeline.

09:04.000 --> 09:08.000
So it's a single sequence and several tools to get the structures.

09:08.000 --> 09:10.000
You can see how you can spin them.

09:10.000 --> 09:12.000
You can choose which one you want to show.

09:12.000 --> 09:17.000
Then you can see here the accuracy of each of the tools.

09:17.000 --> 09:22.000
The average accuracy.

09:22.000 --> 09:33.000
What you can see afterwards is that the accuracy along the sequence.

09:33.000 --> 09:38.000
And then finally, you can see the coverage along the sequence as well.

09:38.000 --> 09:43.000
Let's see if I...

09:44.000 --> 09:48.000
So...

09:48.000 --> 09:51.000
Yes.

09:51.000 --> 09:52.000
Yes.

09:52.000 --> 09:54.000
I will stay in this one.

09:54.000 --> 09:55.000
Ah.

09:55.000 --> 09:56.000
Is it opposite?

09:56.000 --> 09:57.000
Okay.

09:57.000 --> 09:59.000
What is that?

09:59.000 --> 10:02.000
Sorry about that.

10:02.000 --> 10:08.000
Okay.

10:08.000 --> 10:14.000
So this is a real application of the pipeline that has been done by Kiran Roewell in Sydney.

10:14.000 --> 10:24.000
What he has done is that he wanted to prove these hypotheses, whether a sales evolved via San Bios is between

10:24.000 --> 10:27.000
two studies in bacteria and in the same producer in Arcaya.

10:27.000 --> 10:29.000
So he got all the...

10:29.000 --> 10:34.000
He's... they sequenced all the genome of one of these Arcaya.

10:34.000 --> 10:40.000
They then use... in a corporate info to predict all the structures of these... of these Arcaya.

10:40.000 --> 10:45.000
And for this, he was using alternative methods because for some of them it was easy.

10:45.000 --> 10:47.000
He was using just ESM fault.

10:47.000 --> 10:48.000
And that's all.

10:48.000 --> 10:53.000
But when he was dealing with more long sequences, sometimes it was not working.

10:53.000 --> 10:56.000
Then he moved to Alphafall 2 or in some ways that weren't even longer.

10:56.000 --> 11:00.000
He used or more difficult to infer he used Alphafall 3.

11:00.000 --> 11:05.000
And of course it was very interesting having these reports to compare these performance.

11:05.000 --> 11:13.000
And then he used Faultsick to see whether these structures actually fall close together with the bacterial sequences.

11:13.000 --> 11:18.000
But this is... I think I'm a little bit out of time.

11:18.000 --> 11:21.000
You can beat the paper here.

11:22.000 --> 11:24.000
This is not the last version of the... Sorry.

11:24.000 --> 11:26.000
Well, I will move here.

11:26.000 --> 11:31.000
And another interesting thing of NF Core is that...

11:31.000 --> 11:37.000
NF Core makes that the pipelines can last longer.

11:37.000 --> 11:39.000
And why is this?

11:39.000 --> 11:41.000
Because sometimes you work in your lab.

11:41.000 --> 11:42.000
You have a very nice pipeline.

11:42.000 --> 11:44.000
But at some point in the funding stops.

11:44.000 --> 11:48.000
Or maybe you don't have more in-setific interest on this pipeline.

11:48.000 --> 11:49.000
But this pipeline is a very nice pipeline.

11:49.000 --> 11:51.000
And it's useful for a community.

11:51.000 --> 11:55.000
And what happens in NF Core is that as you can see with the sample of this...

11:55.000 --> 11:59.000
This smaller-nastic pipeline is that the people that have started the pipeline

11:59.000 --> 12:01.000
is the life-love at some point.

12:01.000 --> 12:03.000
They even stop working on it.

12:03.000 --> 12:06.000
But all the groups were from...

12:06.000 --> 12:08.000
Both from industry and from...

12:08.000 --> 12:11.000
Academia were already contributing to it.

12:11.000 --> 12:12.000
And they just took over.

12:12.000 --> 12:14.000
So this means that you...

12:14.000 --> 12:16.000
Did that you pipeline last...

12:17.000 --> 12:18.000
Not forever.

12:18.000 --> 12:20.000
I would say that longer than if you are in...

12:20.000 --> 12:22.000
Isolated in your isolated lab.

12:22.000 --> 12:24.000
And here you have all the contributors.

12:24.000 --> 12:25.000
That have contributed to the NF Core.

12:25.000 --> 12:26.000
But in full pipeline.

12:26.000 --> 12:28.000
From the Sierra Lea in Barcelona.

12:28.000 --> 12:29.000
Also UPF in Barcelona.

12:29.000 --> 12:33.000
We are technically working with people in Sydney.

12:33.000 --> 12:34.000
In the UN's.

12:34.000 --> 12:35.000
The UN's.

12:35.000 --> 12:36.000
And also from Australia.

12:36.000 --> 12:37.000
Come on.

12:37.000 --> 12:38.000
And then your university.

12:38.000 --> 12:40.000
University of Korea.

12:40.000 --> 12:43.000
So if you want to join us, here you have...

12:43.000 --> 12:44.000
S7 out of the ways.

12:44.000 --> 12:45.000
How you can do it.

12:45.000 --> 12:47.000
So you can join the NF Core.

12:47.000 --> 12:49.000
There are several ways to do it.

12:49.000 --> 12:53.000
You can also join a Slack, which is the main communication channel of the NF Core.

12:53.000 --> 12:58.000
And there we have the protein-full channel, which is for the users of the pipeline.

12:58.000 --> 13:01.000
And the protein-full web channel, which is for the developers of the pipeline.

13:01.000 --> 13:03.000
And you are very welcome to come here.

13:03.000 --> 13:05.000
And also the GitHub repository.

13:05.000 --> 13:09.000
We have a paper in preparation that we want to submit soon.

13:10.000 --> 13:19.000
More importantly, we have all the tools that have shown you are not yet in the release version of the pipeline.

13:19.000 --> 13:20.000
They are still in development.

13:20.000 --> 13:25.000
But we are planning this release for, I think, in two weeks should be ready.

13:25.000 --> 13:27.000
I wanted to be ready for this presentation.

13:27.000 --> 13:28.000
But it was not.

13:28.000 --> 13:32.000
And here you have other things that we want to optimize on the pipeline.

13:32.000 --> 13:34.000
But you can take a look.

13:34.000 --> 13:36.000
And maybe we can discuss afterwards.

13:36.000 --> 13:40.000
And finally, I just want to finish thinking my supervisors,

13:40.000 --> 13:42.000
I think Notre Dame, of the people, of my lab,

13:42.000 --> 13:47.000
especially Athanasios Vatsis, who was the original developer of the pipeline.

13:47.000 --> 13:51.000
Also the people in Australia, marketing from Korea.

13:51.000 --> 13:56.000
The people from Zekera, both the NF Core and the XO community.

13:56.000 --> 13:59.000
And also the AWS.

13:59.000 --> 14:01.000
Open data-sports and simpler.

14:01.000 --> 14:05.000
Who is allowing us to host all the data-basis in, in, in, in,

14:05.000 --> 14:06.000
in, in, in, in, in, in, in, in, in, in, in, in, in, in, in, in, in.

14:06.000 --> 14:08.000
So thank you very much.

14:08.000 --> 14:10.000
And I am open to questions.

14:10.000 --> 14:12.000
Thank you very much.