WEBVTT

00:00.000 --> 00:13.960
Okay, let's quiet the room again, so we're back to Long Talks, the lighting talks are over.

00:13.960 --> 00:20.520
And the first one would be on Dynamo, so part of this film and video, and actually the first

00:20.520 --> 00:26.040
time I heard the Dynamo talk was a few months ago, so it's another albumer, since

00:26.040 --> 00:31.960
the difference is clear, and actually very curious to hear what changed, what's new.

00:31.960 --> 00:38.800
So please welcome, daughter, talking about Supercharger and LOM, sort of in this Dynamo.

00:38.800 --> 00:47.360
Hello, hello everyone, yes, I'm Paul, I'm a software engineer at Nvidia, and I mainly

00:47.360 --> 00:55.120
contribute to Dynamo, specifically for the design-graded serving part, and as a part of that

00:55.120 --> 01:04.200
I also contribute to Viel LOM, which is one of the infrasemges that runs under Dynamo.

01:04.200 --> 01:11.560
So what is like, where does Dynamo place, it plays basically on top of infras frameworks,

01:11.560 --> 01:18.360
and infrasemges like Viel LOM is DRONK and Lama CPP and others, and this is a serving

01:18.360 --> 01:28.040
library targeted mainly for large-scale distributed serving, but not exclusively for that.

01:28.040 --> 01:33.160
And it's actually a collection of libraries, so there's the main one Dynamo, but there are

01:33.160 --> 01:41.400
a lot of tooling libraries around that, and all of those are open-source, and we already

01:41.480 --> 01:49.680
have over 200 contributors, and we try, this is mainly not like originally in the Nvidia DNA,

01:49.680 --> 01:57.120
but we try our best to be, to work in an open source manner as much as possible.

01:57.120 --> 02:05.760
Yeah, so like the ultimate goal of Dynamo is to speed up the inference and allow large-scale

02:05.840 --> 02:12.080
inference to run as fast as possible, and this is our main objective, this is taken

02:12.080 --> 02:18.880
apart from the semi-analysis inference marks benchmark, so basically in the x-axis we have

02:18.880 --> 02:24.240
how many tokens per second per user can generate, so this is from the perspective of users,

02:24.240 --> 02:29.360
how fast the tokens are generated for you, and then only y-axis we have how many overall

02:29.440 --> 02:35.680
tokens per seconds are produced by our system, so this is basically how cheap is the inference,

02:35.680 --> 02:45.360
right, so we want to be top-right, and we have some success in that matter. So maybe I will just

02:45.360 --> 02:53.120
quickly preface what was the original motivation to create Dynamo, and it was to allow to create

02:53.360 --> 03:00.160
an open source framework for disaggregated serving, and basically if someone's not familiar,

03:01.040 --> 03:07.840
we have two phases of an engeneration one would be to compute the context, which is like pre-feeling,

03:08.640 --> 03:14.080
and then once we compute the context, the prompt we generate token by token, and those problems are

03:15.280 --> 03:20.880
computationally very different problems, one is more compute-bound, the decodes is manually bound,

03:21.600 --> 03:28.240
so we want to optimize them separately, so we need to separate them first, so that solves some

03:28.240 --> 03:34.720
problems, but creates many, many other problems. One of the points for example is that we create

03:34.720 --> 03:40.560
different generation, different configuration for pre-feeling, for decode, they can span

03:40.560 --> 03:46.560
different number of nodes, on the data center, we need to scale them independently, we need to

03:46.560 --> 03:53.520
route requests between them, transfer, decay, decay between them, and so on, so a lot of work to

03:53.520 --> 04:02.160
be done to actually make it worthwhile, and why it's so hard in practice to have efficient

04:02.160 --> 04:07.840
large-scale system is, for example, in the benchmark mentioned both, it was running on

04:07.840 --> 04:16.400
the 300 GPUs, so we are so need to be able to route between all of those, we need to transfer

04:16.400 --> 04:24.960
decay between all of those, and the throughput on the front end, on the routers, and so on must

04:24.960 --> 04:32.400
be, must be really high, right? And also since the models are spanning, sometimes two nodes,

04:32.400 --> 04:38.720
sometimes eight, 16 nodes for single-inferencing, the photo runs also becomes quite a big

04:38.720 --> 04:46.560
over problem, we don't want engine failure to compromise our system, and then another one is that

04:46.560 --> 04:54.080
in practice the load changes over time, right? So that's why the good model is not enough to

04:54.080 --> 04:59.840
have a successful product, successful deployment, you also need the good system built around it,

05:00.800 --> 05:09.920
so we can, in fact, serve fast inference, but also tip. So these are the technologies that

05:09.920 --> 05:19.520
dynamic uses, so going from the top it's main, main design to be deploying Kubernetes,

05:19.520 --> 05:26.640
not again, it can be extendable, but this is our main news case, then we have the observability layer,

05:27.600 --> 05:32.160
then we have discovered a plane to discover different dynamo components that also

05:32.160 --> 05:38.000
have been mainly in Kubernetes, then the components must communicate on the communication planes

05:38.000 --> 05:44.000
for the communication plane, and then we can run actual inference on the endings, and then we have

05:44.000 --> 05:52.480
other tooling in dynamo like nix and kvbm to enable kvc transfer and other kvc manipulation

05:52.560 --> 06:02.800
and optimizations, so these are main components, and we try to be a dynamo to be as modular

06:02.800 --> 06:09.280
as possible, so basically other than the bottom layer to discover the communication plane,

06:09.280 --> 06:15.520
you could take out any of other components and work, and you can also implement your own

06:15.520 --> 06:20.960
components, because basically each main component like previous decode router, front end,

06:20.960 --> 06:28.800
is basically only an async generator with multi-in and single-out generation generator,

06:28.800 --> 06:34.720
so that's also for example for inference engine like vl, and it doesn't have a thin

06:35.680 --> 06:42.320
wrapper on the end because the engine is async generator already, right? So we could go through

06:43.360 --> 06:47.200
for the components, so maybe some of them will interest you, some other don't,

06:48.160 --> 06:54.480
they could also some of them be used like planner or router outside of dynamo, for example,

06:54.480 --> 07:04.400
lmd, which is another inference framework mainly developed by Red Hat, so yeah, so the

07:04.400 --> 07:10.800
discovery plane, it's very simple when you spawn new components, other components must

07:10.800 --> 07:19.120
be aware of that, and when some components are scaled down or fails, also other components need

07:19.120 --> 07:24.960
to be aware of that not to router, and for that we simply use Kubernetes, but also outside of Kubernetes,

07:24.960 --> 07:31.760
we have a CD implemented as a backend, and for single node or local development, we also have

07:31.840 --> 07:41.120
simple file system, backend as a discovery plane, not much more there, communication plane,

07:41.120 --> 07:48.960
this simply TCP protocol to communicate between workers, between components, for example,

07:50.320 --> 07:57.120
the request, the prompt, the response, and so on, right? Also there's a front end, which is again,

07:58.080 --> 08:08.880
simply open AI, compatible, front end, again, not much there, I will mention that most of the

08:08.880 --> 08:16.160
components are written in, and the whole dynamo is written rust, but the public API is maintained

08:16.160 --> 08:23.600
in Python, so for example, the specific implementation of workers is done in Python already,

08:23.680 --> 08:32.240
right, but everything underneath here runs on rust, for example, for the front end, so first,

08:32.240 --> 08:43.200
more sophisticated component would be the router, and basically router's goal is to have as high

08:43.200 --> 08:47.200
of a cache-hit rate as possible, right, because for example, you are cutting with your

08:47.200 --> 08:55.440
cloud GPT, and your first quest landed on a specific node, and then you come back two minutes later,

08:55.440 --> 09:00.000
right, and you want to continue your conversation, so we want to route to the

09:00.000 --> 09:07.680
probably to the same node, that it has already kbcached from your previous conversation available,

09:07.680 --> 09:15.760
and not to route somewhere differently to recompute or have to transfer the context from

09:15.760 --> 09:22.240
different nodes, right, so that's the goal of the router, and there are actually two main variables

09:22.240 --> 09:32.640
that router act upon, one is the kvindexer.tracks, which which basically which prefix is available

09:32.640 --> 09:39.680
at which node, and also slot manager, which tracks how busy is a worker, right, because maybe

09:39.680 --> 09:45.920
cache is available, but worker is too busy to accept new work, right, so that's what the router does,

09:45.920 --> 09:51.760
so it just combines those two variables with some weight, which would be a tunable parameter,

09:52.960 --> 10:00.800
but you can also use round robbing or random routing, right, and supposedly it works,

10:00.800 --> 10:08.080
it enables, we've heard from some customers, it enables much, much improvement in time to first talk

10:08.080 --> 10:18.320
and basically from higher cache hit rate, then you have the disaggregation self, so for the

10:18.320 --> 10:24.080
disaggregation menus, the nixelibrary, which is like part of our ecosystem, and nixel is just a

10:24.080 --> 10:28.960
peer-to-peer communication library, so you'll probably heard about nickel, which will be collective

10:28.960 --> 10:35.360
communication library from a new video, and nixel is designed to be a peer-to-peer communication

10:35.360 --> 10:41.760
library with nixelibrary, so you can write your own plugins to support different, for example,

10:41.760 --> 10:49.600
file system, so we use nixel to communicate between tip-to-tip, but also from tip-to-host memory,

10:49.600 --> 10:57.440
and from host storage, lock-astorage, or shared network storage, right, so for disaggregation

10:57.520 --> 11:05.440
menu uses for tip-to-tip communication, so that's how our disaggregated flow looks like,

11:05.440 --> 11:11.840
so first the request goes to the router, router will pick the best prefil worker to handle,

11:11.840 --> 11:16.320
though, that request, prefil worker will do, it's think it will compute the context,

11:17.280 --> 11:25.440
fill the cave cache, and then router will send forward this request to some decode worker,

11:25.440 --> 11:34.160
which will allocate locally cave blocks, it will meet those cave blocks using nixel from the

11:34.160 --> 11:41.120
prefil worker, and then to start generating also in form prefil worker, that it can free up the

11:41.120 --> 11:48.000
decade is already, so this is something, for example, we've implemented together with red-cut team

11:48.000 --> 11:54.640
in NVLAM, so we added what's called nixel connector and some scheduling optimization to NVLAM,

11:55.440 --> 12:06.720
we've done similar work with SGLANG and TRT LLAM, okay, so I will mention nixel, also on top of nixel,

12:06.720 --> 12:14.240
we've built something called KV block manager, and this is also a tool that can be plugged into

12:14.240 --> 12:22.560
VLAM or SGLANG in form of the cave connector, and it's goal to use the hierarched call memory

12:22.560 --> 12:29.520
to reduce number of eviction you have to do in the cave cache, right, because for example,

12:29.520 --> 12:34.560
you will return in one minute, maybe the cave cache will see be there, but if you return to your

12:34.560 --> 12:39.120
cloud in five minutes, maybe there are so many requests for other users in between,

12:39.120 --> 12:43.760
that there are no space for the cave's, so they're got thrown away, a victim from the cave cache,

12:44.560 --> 12:51.120
and now this is to compute that, and yeah, that's costly, right, especially for prefil, it's

12:51.680 --> 12:58.320
the attention cost is n square, right, in terms of the input size, so we want to first

12:59.440 --> 13:07.840
instead of evict move from device memory to host memory, then we want to move to local storage,

13:07.840 --> 13:15.680
and then probably to the network storage, right, if they did the context is long enough,

13:15.680 --> 13:22.560
and the file system is fast enough, it's still makes sense, sometimes it might take longer

13:22.560 --> 13:27.040
to read from the external storage, but you still save a ton on the compute, right,

13:28.400 --> 13:35.440
so that's what cave EM is designed to do, is to extend the life of your cave cache.

13:37.680 --> 13:43.280
And when you use that, you can again see improvement in time to first talk and because you avoid

13:43.360 --> 13:51.200
the computation, which is both costly and time consuming, okay, so then we have some

13:52.720 --> 14:01.360
totally loosely connected with dynamo, that can easily be used outside, like I mentioned before,

14:01.360 --> 14:08.000
one is plan, right, so one of the challenges I mentioned is that the load of the system is

14:08.000 --> 14:14.640
staging over time, right, so we can assume there are some peaks, peak R, peak usage hours,

14:15.440 --> 14:23.440
and we want to scale down profile decode front ends dynamically, preferably, so that's what the plan

14:23.440 --> 14:33.600
is doing, it's confirmed offline, before start deployment, given SLAs, proposed starting configuration,

14:34.480 --> 14:41.680
and for that we also have this tool called air configurator, that you can simulate what will

14:41.680 --> 14:49.040
be the best configuration without even using the GPU, but given hardware and model, we should be able

14:49.040 --> 14:59.680
to wrap the estimate the best starting configuration, and then in real-time planer will

15:00.400 --> 15:12.320
scale the decode, preferably, because up and down to satisfy the SLAs, yeah, and for that, again,

15:12.320 --> 15:23.760
we have small Kubernetes enhancement called growth, and that is used to scale the

15:23.760 --> 15:34.400
multi-node deployments easily, right, because since for example our inference entry is using

15:34.400 --> 15:42.880
four nodes, we need to scale the group of nodes together, and that's why we also develop this

15:42.880 --> 15:50.000
growth, so for example, maybe you've heard from red hudders leader workerset, this is something similar,

15:50.080 --> 16:00.800
maybe more tuned to 20 video hardware, so, and to end user experience would be they would create

16:00.800 --> 16:07.120
some configuration file, they will specify, they want to use front end, and decode worker,

16:07.120 --> 16:15.680
they will give the dynamo command that they can specify, which model, the parallelization technique,

16:15.680 --> 16:21.680
and that will be parsed by dynamo operator, that will enhance that with some

16:21.680 --> 16:29.520
inference entry specific arguments, for example, to handle the setting up master,

16:30.480 --> 16:38.400
opening master ports, passing master addresses IP address to workers and so on, and that,

16:38.400 --> 16:44.320
then will be passed to the growth operator that we actually create all of the resources,

16:44.320 --> 16:52.960
and then deploy them to to your Kubernetes cluster, yeah, a pair of simply a small benchmarking

16:52.960 --> 16:59.680
library that you can then benchmark your open a endpoint, and then more to express this is another

16:59.680 --> 17:07.840
very tiny library that we have in our ecosystem, that's meant to quickly help quickly load

17:07.920 --> 17:16.000
engines, and also developed engines, right? So, yeah, and that's basically all the components,

17:16.640 --> 17:19.200
that's it, thank you.

17:24.880 --> 17:27.520
Any questions? We have a couple of minutes.

17:27.600 --> 17:40.720
Yeah, my question is, how much of it is LLM specific versus more generic to be used with

17:40.720 --> 17:45.840
reduction models, VLMs? Is it LLM specific?

17:48.480 --> 17:56.880
No, so one thing is that dynamo is really nothing specific at all, so like dynamo runtime

17:57.120 --> 18:01.840
itself, so the interface that you can implement for components that we get discovered by each other,

18:02.560 --> 18:08.800
and then on top of that is what we have already implemented, right? So, we've focused on LLMs,

18:08.800 --> 18:16.720
but also on multimodal models as well, but there's nothing dynamo itself that's specific to LLMs,

18:16.720 --> 18:17.840
if that makes sense.

18:17.840 --> 18:28.800
Any more questions? No, last chance. Thank you. Thank you, brother.

