WEBVTT

00:00.000 --> 00:11.720
Where the last talk of the day to wrap up, and this is a fantastic thing, because more

00:11.720 --> 00:21.800
it will tell about the actual application of AI and what he's doing with Ben Storin hardware

00:21.800 --> 00:29.920
on drug discovery, and yeah, I've just learned like it was like a hobby project or it was

00:29.920 --> 00:36.000
a hobby project last year, and now he's working at Ben Storin to continue his work,

00:36.000 --> 00:41.800
so like hobby project turned into a job, and a good one with a passion, this is awesome,

00:41.800 --> 00:48.600
so let's listen to what you want to say about the progress of your project.

00:48.600 --> 00:51.000
Awesome, thanks.

00:51.000 --> 00:55.120
So yeah, I'm going to give a little update on my project T-Dibolt.

00:55.120 --> 00:58.680
I actually gave already a talk at the second edition of the AI,

00:58.680 --> 01:03.960
I plan the conference about T-Dibolt, but that was the like Bolt's one on Tense on Warmhol,

01:03.960 --> 01:08.200
and really a lot has happened since then, and so I thought it would be like cool to give

01:08.200 --> 01:15.240
a little update, so for people who are not already familiar with the project, T-Dibolt

01:15.240 --> 01:22.200
is basically a part of Bolt's two Tense-Dorin hardware, and Bolt's two is a biomelacular structure

01:22.280 --> 01:29.640
prediction model, and basically just improved successor of Alpha Fold 3, and I think like

01:29.640 --> 01:34.040
more people are familiar with Alpha Fold than Bolt's, but actually like Bolt's become industry

01:34.040 --> 01:39.000
standard, and so all the biotech labs use Bolt's, and not Alpha Fold anymore.

01:39.000 --> 01:44.680
And yeah, the project is completely open source, another MIT license, and it works out of

01:44.680 --> 01:48.280
the box on Tense-Dorin hardware.

01:48.280 --> 01:54.600
And I started the project in November, 24, just as my pet project as a side project during

01:54.600 --> 02:02.920
the university, and then in April, 25 at the first working prototype, and since then I was just

02:02.920 --> 02:09.320
optimizing the model, and after I finished my studies, I joined Tense-Dorin, and I get a lot of

02:09.320 --> 02:15.880
time to work on the project and more resources, which is awesome, and I also wrote my thesis about

02:15.880 --> 02:20.920
the topics, so if anyone is interested, you can also check out the report on the thesis.

02:26.760 --> 02:29.640
Yeah, but first I want to talk a little bit about Tense-Dorin hardware,

02:29.640 --> 02:33.960
I want to keep it really brief, because like there were so many talks about Tense-Dorin hardware.

02:35.080 --> 02:40.760
So here we have the Tense-Dorin black hole P150, which is currently the most performant PCI

02:40.760 --> 02:45.720
express card that is available in the top, and yeah, in the middle we have the black hole

02:45.720 --> 02:52.760
processor, then 8, GTDR6, memory banks with 4G over each, so 32 gigabytes, but for me like

02:52.760 --> 02:59.080
the coolest thing is that it has a large S-RAM, 200, 10 megabytes of S-RAM, and it's a scratch

02:59.080 --> 03:06.040
bet, so we can control it explicitly, and stay for a long time in S-RAM for certain operations,

03:06.920 --> 03:13.640
which can be a major advantage against like, or in comparison with GPUs who have like transparent

03:13.640 --> 03:19.480
catches that you can only control implicitly on the right is my acquired box, and yeah,

03:19.480 --> 03:23.080
the boss project right now is only running on a single card, but very soon we want to

03:23.080 --> 03:27.400
polarize it, of course, multi-baccades, there's also a Tense-Dorin Galaxy server with like

03:27.400 --> 03:33.720
32 processors, and yeah, I want to leverage everything for for drug discovery on the

03:33.880 --> 03:41.640
boss project, yeah, here's the architecture of the black hole processor and the Tense-Dorin

03:41.640 --> 03:47.080
Core, so black hole processor, the architecture is basically just a grid of pulse, we have

03:47.720 --> 03:54.360
even at course DRAM cores system controller, but we're actually the compute heavens,

03:54.360 --> 04:00.040
that's the Tense-Dorin Core, and on the right we have the architecture of Tense-Dorin Core, and we have

04:00.120 --> 04:06.360
five small risk five chips, and they basically control the routers, and the data movement and

04:06.360 --> 04:13.480
compute engines, and each Tense-Dorin Core has also, I think, on black hole 1.5 megabits of S-RAM,

04:15.880 --> 04:22.440
yeah, but I think not everyone here is familiar with structure prediction,

04:23.640 --> 04:29.400
so proteins are basically the bidding box of life, they end up in all processes,

04:29.640 --> 04:36.200
in life, and they are defined by sequence of amino acids, and the sequence of amino acids

04:36.200 --> 04:43.240
falls into a 3D structure in the human body, and the 3D structure determines the function

04:43.240 --> 04:50.040
of what the protein does in the human body, so it is really important to predict the structure,

04:50.040 --> 04:57.720
but it was a huge problem for 50 years and biology, and I thought for two basically solved

04:57.880 --> 05:04.360
this problem and therefore was awarded the Nobel Prize in 2024, and it can predict those structures

05:04.360 --> 05:15.160
in minutes and not years. So, both two builds on top of alpha-fold, but it adds also affinity

05:15.160 --> 05:22.920
prediction there in the top right corner, and affinity prediction predicts basically the probability

05:22.920 --> 05:28.840
and how strong a small molecule binds to a protein, and for example, a protein could be a harmful

05:28.840 --> 05:35.080
protein, and then you want to design a drug so a small molecule, the ligand here, that attaches

05:35.080 --> 05:40.040
to the protein, and maybe disables this function, and that's why binding affinity is so

05:40.920 --> 05:48.040
important for drug discovery, and yeah, I'm contrast to alpha-fold, the bottom model is fully open

05:48.040 --> 05:53.640
source and not only for academic usage, and the bottom way of the whole architecture of the

05:53.640 --> 05:58.760
builds, two models it can see pretty complex, but the most important modules are really just

05:58.760 --> 06:07.880
the performer in the middle, the diffusion model, and maybe also the MSA module, so formally

06:07.880 --> 06:15.000
it is basically just diffusion module, but it is a pretty unique diffusion module because most

06:15.080 --> 06:21.800
of the computational complexities in this pairformer, which is a transformer, and yeah, that's what

06:21.800 --> 06:29.800
makes a different from pretty different from other diffusion modules. This is another way of

06:29.800 --> 06:37.160
looking at the architecture, and there we can see that both two basically consists of a few

06:37.160 --> 06:42.680
number of big modules, which are composed of smaller modules, and the main big modules are the

06:42.680 --> 06:48.040
pairformer diffusion module, and the MSA module, and for example, see that the template module,

06:48.040 --> 06:54.440
the confidence module, and the affinity module are just thin repas around the pairformer, and so

06:54.440 --> 07:01.560
the task of plotting the builds model to transform hardware was just plotting those big modules

07:01.560 --> 07:07.880
to the tensed-on architecture. And yeah, I would say the most interesting small modules are

07:08.760 --> 07:15.720
trying a multiplication and triangle attention, and they have cubic complexity, not just

07:15.720 --> 07:22.600
quadratic, complexity, and context, for example, to regular self-attention, and the triangle

07:22.600 --> 07:29.080
comes from the fact that between those atoms, the triangle inequality holds, and that is just

07:29.080 --> 07:35.080
inductive bias, and that's basically what the researchers did with alpha-fold and bolts. It feels

07:35.080 --> 07:40.760
like engineering inductive biases, so they take regular machine learning operations, but they have

07:40.760 --> 07:46.520
some requirements, for example, physics, and they have to engineer those inductive biases

07:47.560 --> 07:55.960
in the model. And another interesting thing about the bolts to model are the activations. So

07:55.960 --> 08:01.320
we have a single representation, this is a usual representation, for example, in transformers,

08:01.320 --> 08:07.480
in transformers, we have a sequence of tokens, and therefore sequence of embeddings, and therefore

08:07.480 --> 08:13.880
this matrix on the left, but additionally, in the bolts to model, we have a pair representation,

08:13.880 --> 08:20.200
and this pair representation has additional dimension, which is as large as the sequence length,

08:21.000 --> 08:27.000
and this is the root cause for this increased complexity. This means, for example, that the activations,

08:27.160 --> 08:33.240
the memory complexity of the activations are quadratic and sequence length, and so for larger

08:33.240 --> 08:38.040
molecules, most of the memories consume by the activations and not the weights.

08:40.040 --> 08:44.840
The bolts to model is actually a pretty small model, but activations can get really large,

08:44.840 --> 08:49.080
and this is also this additional dimension, which is as large as the sequence length,

08:49.080 --> 08:55.080
is also the root cause for the increased for the cubic time complexity of triangular tension

08:55.160 --> 09:03.320
and triangle multiplication. And yeah, that's basically this skeleton, how I

09:04.120 --> 09:09.960
ported our bolts to 10 stone hardware. Those are just two of the big modules, but I just

09:09.960 --> 09:15.800
rewrote them in TTNN, then put them in a PyTorch wrapper, so from the outside they look like

09:15.800 --> 09:25.000
PyTorch modules, and then I could use it in the existing code base. And then yeah,

09:25.000 --> 09:29.480
most of my time was actually spent on performance optimizations, and the key performance optimizations

09:30.280 --> 09:36.360
that I did across all those modules was for triangle tension, I integrated flash attention to,

09:36.360 --> 09:42.520
I will definitely go into that later. Then of course, like always, fusing operations, for example,

09:42.520 --> 09:47.800
if you compute queries, keys, and values, you have three linear transformation, but you can

09:47.800 --> 09:52.360
fuse them into one, then you have a bigger matrix multiplication, that means higher arithmetic

09:52.360 --> 09:57.960
intensity, and so you can utilize the matrix units more, but in general, just fusing operations,

09:58.840 --> 10:04.920
having less kernel overhead staying long and SRAM, there was important. And since the

10:04.920 --> 10:10.280
activations have high memory complexity, I use chunking a lot, so processing those activations,

10:10.360 --> 10:15.960
those tensors, and chunks that we also can keep in SRAM, that was really important. And of course,

10:16.520 --> 10:23.480
mixed precision, so by default, we use BFLO16, but whenever we can, whenever it doesn't affect

10:23.480 --> 10:29.880
the accuracy too much, we use block f, p8, and for example, for triangle attention, we can

10:29.880 --> 10:39.880
basically do all of it and block f8. And because, yeah, because triangle attention is one of the

10:39.960 --> 10:47.720
most interesting small modules, I just wanted to give briefly the idea how we used regular

10:47.720 --> 10:54.440
flash attention to for that. And we have here the standard attention formula, and in the standard

10:54.440 --> 11:01.400
attention formula, we add attention mass. And in triangle attention, we just have to replace the

11:01.400 --> 11:07.720
attention mass with the triangle bias. And there's like, there's a small difference between the

11:07.800 --> 11:13.320
attention mask and the triangle bias. The attention mask is the same for all attention heads.

11:13.320 --> 11:21.240
So we can broadcast it across the head dimension. But the triangle bias is actually a different

11:21.240 --> 11:27.560
per head, but we have to broadcast this across the bedstimension. And so the first trick that we did

11:27.560 --> 11:33.960
was just permuting the head dimension and the bedstimension because they are both independent.

11:33.960 --> 11:40.680
But even those permutations took too long. And so in the end, we just added support for

11:40.680 --> 11:48.280
batch board, brush casking, to the flash attention color. And this way, we can just use

11:48.280 --> 11:56.280
flash attention to for triangle attention. And no one can talk a little bit about the evaluation

11:56.280 --> 12:02.440
about the performance. This is, of course, just the baseline. So both, both, two on a CPU

12:02.520 --> 12:10.520
and both two on black hole. And yeah, we can see that it is significantly faster than a CPU

12:10.520 --> 12:16.040
for all sequence length. And with sequence length, I always mean the number of amino acids

12:16.040 --> 12:23.640
and are protein. More interesting is, of course, the comparison with a GPU. And so I wanted to

12:23.640 --> 12:31.000
compare the best, the best PCR specouts from NVIDIA and from 10-strand, but I available for

12:31.880 --> 12:38.680
consumers. But of course, just comparing the inference time would be a little unfair because

12:39.960 --> 12:47.960
like the average over the last year of RTX 5090 or were $3,000, right now, it's not even available

12:47.960 --> 12:54.520
for this price. It turns out significantly faster. And if we look at the predicted structures

12:54.600 --> 13:05.640
per hour per dollar, we have better results for the 10-strand cards for all proteins except

13:05.640 --> 13:13.720
the one with more than 1,000 amino acids. But I still think that there's a buck for large proteins.

13:13.720 --> 13:19.720
And I hope we can solve it and also get better for larger proteins. So I think like 10-strand

13:19.720 --> 13:28.920
cards already have very good price performance and for small proteins, 10-strand cards actually,

13:28.920 --> 13:37.560
the 10-strand black hole cards actually faster than RTX 5090. And when optimizing the performance,

13:37.560 --> 13:43.640
poor filings, of course, are very important and looking at how the runtime is distributed across

13:43.720 --> 13:52.600
all sequence length. And I think the across all the big modules and I think the most interesting

13:52.600 --> 14:02.040
thing is here that as the sequence length increases, the performer more and more dominates the

14:02.040 --> 14:07.560
runtime. And that's because only the performer contains trying a lot of tension and trying a

14:07.560 --> 14:14.920
multiplication, which are the only small modules that have cubic complexity. And so for large

14:14.920 --> 14:21.640
proteins, we mostly care about the performer, but for small proteins, it's also very important

14:21.640 --> 14:31.880
to optimize, for example, the diffusion module. And yeah and that's basically how a predicted

14:31.880 --> 14:37.000
structure looks like. This protein structure, the blue protein structure was predicted on a

14:37.000 --> 14:43.320
10-strand black hole card. In green, we have the experimental ground truth. And so yeah,

14:43.320 --> 14:49.480
it is pretty accurate and that's also confirmed by all the important metrics. And I guess we

14:49.480 --> 14:56.120
still have a little bit of time so I can also hopefully show a little demo.

15:07.000 --> 15:14.840
Oh, maybe I have to start it again.

15:37.480 --> 15:44.360
Yeah, so I think it works now. So the first part was generating the MSA. Then it is loading the

15:44.360 --> 15:50.520
module into the memory. Then the performer runs and right now that is the diffusion module.

15:50.520 --> 15:56.840
And you can see how the protein is folded in return. There was pretty fast. And then you can also

15:56.840 --> 16:02.440
see, for example, for each amino acid here, how confidence the model is. And the prediction

16:02.440 --> 16:09.480
all the over all confidence and some metrics that are important for biology.

16:10.600 --> 16:14.280
So I think we're that we can open up for a Q&A maybe.

16:33.080 --> 16:41.640
Yeah. On the comparison is the NVIDIA running on FP32 and TN-10-strand on BF16 or base.

16:47.640 --> 16:55.160
Yeah, the NVIDIA cards also running on BF16, but they don't use block FPA anywhere.

17:03.320 --> 17:12.040
Okay, good. Thank you very much. This is a wrap. I'll give it to you.

17:13.880 --> 17:16.680
Thank you.

17:25.560 --> 17:30.280
Okay, thank you everybody for joining us today. So if you want to continue this discussion,

17:30.360 --> 17:39.080
you can join us also on Monday on the fringe event. You can also just continue to the breweries

17:39.080 --> 17:43.080
and continue the evening. Thank you everyone.

