WEBVTT

00:00.000 --> 00:20.000
It's currently estimated that it takes about 50 terawatt hours of electricity to run the world's AI data centers with estimates that this is going to go up to even closer with estimates that this is going to go up to

00:20.000 --> 00:29.000
400 terawatt hours by 2030. As well as the cost of running these data centers we have the cost of getting data to the data centers.

00:29.000 --> 00:41.000
It also often said that it takes about 0.03 kilowatt hours of electricity to transfer just one gigabyte of data and if you're transferring you have a cost in time as well.

00:42.000 --> 00:59.000
For many applications these trade-offs are just not acceptable and a great deal of what we are talking about today in the AI diagram is how you own your own AI and how you can do things locally yourselves.

00:59.000 --> 01:11.000
While there's a lot of very good tooling and infrastructure to do things locally and to do things yourselves for some particular use cases the infrastructure sort of nascent and underdeveloped.

01:11.000 --> 01:18.000
And one particular use case that is of interest to us is the microcontroller class of processes.

01:18.000 --> 01:26.000
Where if you if you want to run things on a sort of microcontroller which doesn't necessarily have an operating system a file system.

01:26.000 --> 01:38.000
It is possible at the moment but it is certainly difficult most especially if you're trying with your personal project or with your business to do something interesting and novel.

01:38.000 --> 01:49.000
So as we've been introduced I'm Dr. William Jones and I'm with my colleagues James Lattery and Pietro Ferroa and our goal today with him because and by the way.

01:49.000 --> 02:04.000
And our goal today is to show you that while the tooling for this microcontroller class of processes is nascent and underdeveloped it is very possible to bring up any AI project that you're doing the personal or business on.

02:04.000 --> 02:21.000
With a good AI modeling framework using only three and open source tooling. In our case we will be talking through a project we have done with a novel with a high processor with a custom accelerator bringing up the executive torch framework.

02:21.000 --> 02:25.000
And Pietroa is going to start our talk about that.

02:34.000 --> 02:47.000
Okay, say can you guys hear me well.

02:47.000 --> 02:53.000
Okay, so what are Pietroa tuners at the torch?

02:53.000 --> 03:02.000
So our high level AI inference falls a predictable path as most people may already know.

03:02.000 --> 03:06.000
Monera work is represented as a graph.

03:06.000 --> 03:20.000
So here we have a graph evaluator which traverses the graph performing intensive assignment and operation evaluation which means allocating memory and doing the maths.

03:20.000 --> 03:28.000
And then the departure sends this task through the best hardware, the CPU or a specialized accelerator.

03:28.000 --> 03:33.000
Crucially before, as a torch, runs the program.

03:33.000 --> 03:45.000
We can apply graph level transformations to simplify or fuse the graph, making the model linear before even reaches the chip.

03:45.000 --> 03:48.000
So why do we need a new tool?

03:48.000 --> 03:52.000
So stand a pie torch is massive.

03:52.000 --> 03:59.000
Is a Python based framework built for both training and inference.

03:59.000 --> 04:04.000
Which means they should have it for embedded systems.

04:04.000 --> 04:12.000
Executive torch however, it's like way sibling handling inference only.

04:13.000 --> 04:24.000
And our goal is to take a model trained in the flexible pie torch environment and hard on it to run on tiny resource constraint devices.

04:24.000 --> 04:29.000
Executive torch follows a build and then customized approach.

04:29.000 --> 04:33.000
So the build phase is handled by the out of time compiler.

04:33.000 --> 04:41.000
In your desktop, it does all the have a lifting, simplifying the logic and stripping away Python so the chip doesn't have to.

04:42.000 --> 04:54.000
And then we have the cost of my face, which is bought lives in the hardware.

04:54.000 --> 05:03.000
And to bridge the gap between the two, we use a backhand delegate, which during the AOT phase, the head of time phase.

05:03.000 --> 05:18.000
We tell us that torch, don't run as maps on the CPU, package them from a specialized accelerator, allowing us to plug the hardwares unique bring into this torch body.

05:18.000 --> 05:27.000
So the build pipeline, so the journey from research production follows a linear path.

05:27.000 --> 05:35.000
We have the part which model, which is the starting point, is the full, full of Python overhead.

05:35.000 --> 05:43.000
Then we have the AOT export, which compress the strips all the Python and serialized the logic into a PTF file.

05:43.000 --> 05:51.000
The PTF file is the universal language for a Zacot torch, so it contains the graph, the weights and instructions.

05:51.000 --> 06:01.000
Then we have the runtime, which is the lightweight engine, on the device that reads the PTF file, allocates the memory and executes.

06:01.000 --> 06:09.000
And the result of all of this is fast inference and devices that don't even have an operating system.

06:09.000 --> 06:13.000
We have some constraints.

06:13.000 --> 06:23.000
We applied all of this to a risk five base platform in a bare mat to environment, which means the radical strings are very absolute.

06:23.000 --> 06:37.000
So memory is a luxury, we work with megabytes, non gigabytes, and by using the AOT phase to pre-calculate memory of sets, we save the chip from having to figure out at runtime.

06:37.000 --> 06:41.000
We have few cores, few accelerators, and most importantly, we don't have an OS.

06:41.000 --> 06:47.000
So there's no Linux, no file system, and no dynamic linker.

06:47.000 --> 06:53.000
So yeah, the Zacot torch runtime is statistically linked directly into a binary.

06:53.000 --> 07:01.000
And because it's modular and dependency-free, we can run the PTF model directly on the metal.

07:01.000 --> 07:11.000
And turning a specialized chip into the dedicated AI engine, which gives us a lot of sorts of minor changes.

07:11.000 --> 07:20.000
When it comes to customizing the performance, we have two main methods, we have dropping replacement.

07:20.000 --> 07:30.000
So if you have a hand-shoon kernel for a standard operation, for example, a convolution, you can simply swap the default version for the optimized one at build time.

07:30.000 --> 07:41.000
And we also have more advanced optimizations, for example, graph level optimizations that can fuse layers together.

07:41.000 --> 07:53.000
And by the end of this pipeline, we move from a generic model to a highly tuned system where software and hardware act as a single cohesive unit.

07:53.000 --> 08:01.000
And now I'm going to pass them to my colleague Shane. We'll go each more detail about the observations we have worked on.

08:01.000 --> 08:03.000
Thanks, Pietro.

08:03.000 --> 08:10.000
So yeah, we've been working on as Pietro said, a risk-by-processor with a custom NPU.

08:10.000 --> 08:22.000
And we've employed a number of optimization strategies in order to get the model working as fast as possible.

08:22.000 --> 08:37.000
So the baseline of executor's will essentially, it will take the, it has the operators all built in from Pietro's.

08:37.000 --> 08:46.000
So anything you bring down from Pietro's to executor's, it has an implementation from a lot of the main operators.

08:47.000 --> 08:52.000
But obviously, you want to optimise this.

08:52.000 --> 09:04.000
So the baseline here is where a single core CPU will, it'll take the sensors and work on them.

09:04.000 --> 09:14.000
So the first optimization strategy we employed was tiling.

09:14.000 --> 09:30.000
So tiling allows for breaking up the problem and essentially making it so that you can do things concurrently.

09:30.000 --> 09:41.000
And this has been pretty instrumental in terms of getting the operators to work as fast as possible.

09:41.000 --> 09:47.000
The next optimization strategy we worked on is multi-treading.

09:47.000 --> 09:53.000
And this again, then allows you to take the tiles you've just done.

09:54.000 --> 10:13.000
And split this across multiple cores on your CPU or on your cluster to allow for, again, working as fast as possible with the resources you have.

10:13.000 --> 10:23.000
So a lot of the tensors in Pietro's and executor's are all float 32, which are 32 bits.

10:23.000 --> 10:30.000
And these are expensive operators to work with, especially on an embedded platform.

10:30.000 --> 10:37.000
So the other optimization strategy we have worked on is quantization.

10:37.000 --> 10:58.000
And this process is essentially where you take the, do you take the 32 bit float and transform them down into 8 bit integers for more optimal and faster performance.

10:58.000 --> 11:04.000
So the first optimization is in regards to memory.

11:04.000 --> 11:15.000
This is where we have been using the multiple cores with the, specifically L1 and L2 memory.

11:15.000 --> 11:27.000
So, through the use of a DMA, we can break the down into tiles and bring these into L1 memory for faster performance.

11:27.000 --> 11:35.000
And yeah, to work on things quicker.

11:35.000 --> 11:49.000
So this then allows us to, the tensors generally tend to live an L2 memory because this is where you have the most memory.

11:50.000 --> 11:56.000
As your L1 tends to be pretty small.

11:56.000 --> 12:08.000
And with the use of the DMA, you're allowed to then break these into the tiles and use the tiling algorithms that we have been working on.

12:08.000 --> 12:26.000
So, pretty, for the likes of an image convolution, this lets you pre fetch that to break this down into sub-tiles.

12:26.000 --> 12:41.000
So, as you can see, this is a pretty straightforward algorithm where, as you are computing on tile n, you're able to then use a different core to load tile n plus 1.

12:41.000 --> 12:47.000
So, in that sense, you are essentially double buffering.

12:48.000 --> 12:57.000
And with the use of the DMA, this lets you quickly swap between memory as needed.

12:58.000 --> 13:10.000
So, as you're computing tile n, you're way fewer results. You can store your, um, store your results into the output tensors.

13:10.000 --> 13:18.000
And then you can compute on your next buffer straight away. And this essentially loops around.

13:18.000 --> 13:31.000
You can load the next tile as you're waiting for the computation of the previous one.

13:31.000 --> 13:39.000
The next authorization that we employed is quantization.

13:39.000 --> 13:53.000
And as I previously mentioned, this allows you to take the 32-bit weights and biases and other tensors that you may have.

13:53.000 --> 14:00.000
And break them down into 8-bit integers.

14:00.000 --> 14:08.000
This is a pretty core optimization that is very necessary for these embedded platforms.

14:08.000 --> 14:19.000
Because essentially the, um, working with 8-bit integers, it just tends to be much faster than using floats.

14:19.000 --> 14:40.000
So, you can calculate the scaling parameters needed and a offset, which it will let you take the 32-bit floats and roughly map them to an 8-bit integer.

14:40.000 --> 14:58.000
And this then can be used post-training to put into your operator and get out results much faster.

14:58.000 --> 15:06.000
So, based on these optimizations, this is the benefits we've seen so far.

15:07.000 --> 15:12.000
This is ongoing progress, so there's other optimizations that you can make.

15:12.000 --> 15:25.000
But with the current optimizations I've already outlined, as you can see, there's a 3.5-ish performance benefit with a convolution,

15:25.000 --> 15:29.000
sure the use of, um, tiling and acceleration.

15:29.000 --> 15:49.000
And with softmax, there is a, um, at least a 2 times benefit, but we've seen higher benefits with bigger tensorsizes, um, where the green is the baseline, which is the operators,

15:49.000 --> 15:58.000
that executive torch implements, and the blue is our versions of the operators.

15:58.000 --> 16:10.000
So, we'd like to thank the team at Mosaic SCC, who supported our work with this processor.

16:10.000 --> 16:29.000
And yes, so this is being, uh, this allows us to take those big PyTorch models that are really energy, um, not efficient and make them energy efficient on small and better devices.

16:29.000 --> 16:31.000
Thank you.

16:31.000 --> 16:39.000
And I think we have five minutes for questions.

16:39.000 --> 17:06.000
Yes, if anyone.

17:06.000 --> 17:18.000
Now, uh, so we've great quantization comes great loss of accuracy, so I wonder just what can you run up yet.

17:18.000 --> 17:35.000
We quantizing at eight bits, we have similar issues, not being able to run, uh, sound related models, so I wonder what are your numbers on this case.

17:37.000 --> 17:50.000
We're still developing this fairly early on our numbers, our numbers look good, but we can't conclusively say that we can't conclusively say that we've solved the problems that everybody else is struggling with yet.

17:50.000 --> 17:54.000
Sorry, I don't have anything more concrete in that.

17:54.000 --> 18:03.000
Uh, something that just, while you, while you find the next, there's a question, um, there is one.

18:03.000 --> 18:10.000
While we're doing this, um, this convolution graph here, where we're comparing data, this is quantized convolution versus quantized convolution.

18:10.000 --> 18:15.000
So, if we were doing quantized versus on quantized, it would be expected that we'll solve a full full increase.

18:15.000 --> 18:18.000
Best, because we're going home.

18:18.000 --> 18:26.000
Quantized versus quantized, it's very, very close to the six.

18:26.000 --> 18:35.000
Uh, so I'm wondering, we'll catch.

18:36.000 --> 18:43.000
So, uh, have you looked at scaling this up to larger rest five systems in chess?

18:43.000 --> 18:57.000
We've explored, we wouldn't use the executive coach framework like this for larger rest five systems, because it's specifically aimed for doing things at the framework control over class processes.

18:57.000 --> 19:02.000
But we have before, but secondly, I'm doing things.

19:02.000 --> 19:04.000
At a much larger scale.

19:04.000 --> 19:08.000
Um, and we, we would use a similar approach.

19:08.000 --> 19:13.000
We would use normal Python to do it, I suppose, is there, is there, is there where we'd got to do with that?

19:13.000 --> 19:19.000
I think the system we looked at when we were looking at doing this, a larger scale had something like a thousand cause of all.

19:19.000 --> 19:21.000
So, yeah.

19:22.000 --> 19:31.000
Um, so, so the question was, um, when you're using TF light.

19:31.000 --> 19:44.000
So, the question was, um, when you're using TF light.

19:44.000 --> 19:56.000
So, the question is, um, with TensorFlow, you can convert all the operators into TensorFlow light.

19:56.000 --> 20:02.000
When using TensorFlow light, um, and this is the same with PyTorch to executorch.

20:02.000 --> 20:11.000
Um, for the most part, yes, it's, um, very, um, capable of, um, as much operators as you need really.

20:11.000 --> 20:23.000
There are, our limitations, specifically, were operators that you'd use during training rather than inference, but, um, I believe TF light tends to have those exact same issues.

20:23.000 --> 20:29.000
Um, and it's, it's still early days where exactly Torch, I believe it's only a few years old.

20:29.000 --> 20:48.000
So, um, it's, as the PyTorch develops and as executorch is developing, um, you're getting more and more optimizations, more and more operators and, yeah.

20:49.000 --> 20:59.000
Something I'd say on that as well is that the, the way executorch does support for this is it maps down the very, very large set of all PyTorch operators down into a sort of smaller core set.

20:59.000 --> 21:04.000
And it's, as Shane said, it's still early stages. It's pretty good at reducing things.

21:04.000 --> 21:09.000
Mapping things from this big set of operators down to the smaller set and in theory, everything should be supported.

21:09.000 --> 21:14.000
Our experience is, it's a little rough around the edges because it's still being actively developed.

