WEBVTT

00:00.000 --> 00:11.520
Hello everyone. Welcome to our presentation on building cloud infrastructure for AI. I think

00:11.520 --> 00:15.720
everyone's quite familiar with what cloud infrastructure is. Today we'll be looking at it

00:15.720 --> 00:21.800
in a context of AI. By AI, I simply mean accelerated compute. For all intending purposes

00:21.800 --> 00:29.960
today, it will be GPUs. Who am I? I'm Dave Hughes, software engineer, system admin, network

00:29.960 --> 00:38.840
person, done it all at some stage. Passionate technologist. I am a open source champion.

00:38.840 --> 00:43.080
I love open source. I've been a digital maintainer at one point in my life. Contributed

00:43.080 --> 00:47.880
back to various user space, normally bug fixies when I've been packaging. The user space

00:47.880 --> 00:55.400
software, whatever else. Primarily problem is over. I like to solve problems.

00:55.400 --> 01:03.120
I'm a customer. My background is originally in typefones computing, parallel computing.

01:03.120 --> 01:08.680
That's sort of thing. I got into GPU compute when it was really new and everybody was excited

01:08.680 --> 01:14.080
by Cudon OpenCL. Got into rendering through that, been contributing to Blender for 11 years

01:14.080 --> 01:20.560
now. Through a pure coincidence, I grew working on distributed rendering for that. I got into

01:20.560 --> 01:26.000
the whole cloud infrastructure area and I ended up really enjoying it. I stuck around

01:26.000 --> 01:34.040
and here I am. Founded it's really cool. I'm still very passionate about making things

01:34.040 --> 01:39.520
perform fast, but the whole performance aspect of things. I found it a really enjoy thinking

01:39.520 --> 01:44.280
about and building the big picture, not just hacking away on one component of the stack

01:44.280 --> 01:52.400
for years at a time. We're going to see a lot of this big picture later on.

01:52.400 --> 01:57.840
First of all, what is a cloud? Simple. We have looking at it. I want to skip through. Again,

01:57.840 --> 02:04.680
abstraction of resources and infrastructure. Self-service for developer teams on demand. Elastic

02:04.680 --> 02:10.720
nature. I think everyone's quite aware of that. API-driven, multi-tenant. Kind of obvious,

02:10.720 --> 02:19.880
but not obvious. How does it factor into GPU? GPUs are typically, other accelerators will

02:19.880 --> 02:26.880
be common in the future, GPUs that type of thing, but GPUs used for accelerated compute. Typically,

02:26.880 --> 02:30.960
RDMA interconnects, if you're doing a big cluster, not always. It's quite common that

02:30.960 --> 02:37.120
you'll see cases where people will throw consumer grade GPUs into servers and do things

02:37.120 --> 02:43.200
like that with no RDMA interconnect. Fast storage. I do want to caveat though. It's not necessarily

02:43.200 --> 02:47.600
fastest. Everyone seems to get hung up on this topic where they think they need to be fastest.

02:47.600 --> 02:54.160
You just need an acceptable level. Not necessarily RDMA either. There's been a bit of a discussion

02:54.160 --> 03:00.080
on this. A company-bending name of XTX released a new distributed file system recently named

03:00.160 --> 03:06.240
TurnFs, and they chose not to use RDMA in that either, and generates quite a bit of discussion.

03:07.760 --> 03:13.520
Dense, very big servers. Again, we think a big, beefy storage server historically, but these days,

03:13.520 --> 03:19.520
it's very big dense compute servers. Pilling as much as, like, between 12 and 20 kilowatts per node.

03:20.880 --> 03:25.680
Free rough skills of consumption in this model. A single GPU, again, just passing through the GPU

03:25.680 --> 03:31.920
to a virtual machine, or an entire node that could be 4 GPUs in the box, it could be 10,

03:31.920 --> 03:37.680
it could be 8, it could even use envealing if you're using Nvidia. Could be cluster,

03:37.680 --> 03:40.960
and that's where I came back to earlier. You're dealing with cross-note interconnect,

03:41.520 --> 03:48.080
and basically tying thousands of these things together. GPU clouds, the good to bad,

03:48.080 --> 03:53.600
it's in the ugly. So there's a few different ways of what these GPU clouds actually come together.

03:54.160 --> 03:57.280
First of all, you'll see many people sort of masquerading as a cloud,

03:57.280 --> 04:02.240
but what they really do is purchase assets, image the assets, and simply hand over the keys to the

04:02.240 --> 04:08.800
BMC, and now you break who buy type deal. Other people will throw open stack on there. It's quite an

04:08.800 --> 04:16.640
obvious sort of path to a quick MVP. It's rough with problems, I would say. Quite tightly coupled.

04:16.640 --> 04:21.920
I mean, many cases where we couldn't have high availability on control planes, you would restart

04:22.000 --> 04:27.440
DHCP server in the news part of the network. It's nice in a lot of places, but it

04:27.440 --> 04:32.080
detect coupled nature of many of the components makes it really difficult to use in production.

04:32.640 --> 04:36.720
Arguably it's designed for private cloud operation on your own premises or whatever,

04:36.720 --> 04:41.440
and people insist on using it as a public cloud for multi-tenancy, and it's not ideal.

04:42.160 --> 04:46.800
This is coming from experience of running it in production for like half a decade or more

04:46.800 --> 04:50.800
and trying to build a GPU cloud with it, not hacking around for six weeks and deciding it

04:50.800 --> 04:58.160
wasn't fit for purpose. Kubernetes, another one that's sort of used today, and people seem to think

04:58.160 --> 05:02.160
that they can throw a workload manager onto a cluster, and all of a sudden it's a cloud,

05:03.520 --> 05:08.960
arguably not. Out of the box no strong isolation of tenants due to the nature of using containers

05:08.960 --> 05:15.040
first as opposed to virtual machines on Linux, shared kernel and drivers, not exactly ideal when

05:15.040 --> 05:19.280
you might have people that want different driver versions for their compute instance and so on.

05:20.080 --> 05:24.320
Various other parts and networks, not many of the common networking options for Kubernetes,

05:24.320 --> 05:28.800
like Kaliko and everything else, not quite what you want for a cloud infrastructure of the box.

05:31.920 --> 05:35.200
Other, you get a few other options that normally come up in cloud infrastructure,

05:35.200 --> 05:39.360
and no mad by Hashecorp was used quite a lot again for additional sort of cluster workload

05:39.360 --> 05:42.560
manager. Some people have successfully built things out of that.

05:43.200 --> 05:49.360
Triton was the cloud stack by giant, very popular back in its day, still being used today by M&X.

05:52.000 --> 05:57.360
How to design a cloud 101? We're done today everywhere, again, kind of obvious, but not obvious.

05:58.320 --> 06:02.800
Scheduling maintenance windows doesn't really scale when you've got a thousand customers.

06:03.760 --> 06:09.280
A lot easier to just be able to take the control plane out of production and still have to available

06:09.280 --> 06:15.920
in the infrastructure. Avoid as many manual steps as possible during bootstrapping, network

06:15.920 --> 06:22.320
allocations, everything in there. Isolation, again quite obvious, most people think it's purely

06:22.320 --> 06:29.360
for security and privacy. It's actually for QLS as well as security. Standardization quite key,

06:30.080 --> 06:34.560
again obvious, but not obvious, but being able to move customers' compute instance

06:34.560 --> 06:42.000
is machine to machine will save you in production. Minimum state, we found this, again, the sort of

06:42.000 --> 06:47.840
MVP that was originally built for the GPU cloud that we had was state full everywhere and it quickly

06:47.840 --> 06:52.960
became a nightmare. We sort of came to a conclusion that if you kept sort of core infrastructure

06:52.960 --> 06:59.600
state full and had everything else stateless. Part of that was down to the not going to say data

06:59.600 --> 07:04.720
centers, but facilities we were operating in former mining sites, power could flap constantly

07:04.720 --> 07:09.920
and if you've got power flapping on a state full infrastructure, it's a nightmare to bring it back up.

07:11.200 --> 07:14.960
A hardware selection for this, there's a few comments that are passed to this.

07:16.640 --> 07:25.440
Standard path normally is OEM kit. Again, Dell, HPE, whatever, it does what it says in the box,

07:26.400 --> 07:30.480
I think about three or four years ago, it was quite difficult to combine most of this kit.

07:30.480 --> 07:35.600
I've super micro were one of the few OEMs actually selling HPE servers in any particular amount of

07:35.600 --> 07:42.640
number. Dell, everyone else doing it these days. Again, the most identical, the all-follow sort of

07:42.640 --> 07:48.720
reference designs from AMD and Nvidia, so on quite overkill and specs. Most people don't actually

07:48.720 --> 07:53.840
need an HPE user server. I think four-way are quite common these days as well with a focus and

07:53.840 --> 07:58.880
shifting to sort of inference in data centers with lower power density. Another important point on

07:58.880 --> 08:03.520
the overkill specs is that often they will have a single spec where they say, if you want this box,

08:03.520 --> 08:09.120
you also need to buy 60 terabytes of SSDs, you also need to buy 800 gigabit nix because that's just

08:09.120 --> 08:15.920
how we sell it. Most people don't be that. Next one. Dell stands at approach. Again,

08:15.920 --> 08:22.480
mainboards, PLX, switch and GPUs. This sort of approach originated from the mining and VFX rendering

08:22.880 --> 08:29.040
or, again, standard server with some GPUs and not great from a density perspective, but if you're

08:29.040 --> 08:35.840
operating in legacy data centers where you've got 8 kilowatt racks, if you've got an AMHP server or

08:35.840 --> 08:40.080
something with a couple of three PCI slots, it's easy just to drop a card of two and then actually

08:40.080 --> 08:46.480
try something out. OTP has obviously been proven to be king at scale, but you really need the

08:46.480 --> 08:52.880
scale to justify non-OEM kit. You're going to have to either sit down with ODMs and Taiwan in

08:52.880 --> 09:00.880
really based or designs of the open compute projects hardware. firmware. I do world, you want to

09:00.880 --> 09:07.600
sell control of the firmware, both from a security perspective and scalability. By open firmware,

09:07.600 --> 09:14.000
we typically mean core boot and open BMC. It's really, really rare to find this in the wild

09:14.000 --> 09:19.680
unfortunately. There's a lot of room for improvement here. I have been really happy to say that

09:19.680 --> 09:24.880
there's active work going on from some of the vendors. Super micro-contributed quite heavily to do

09:24.880 --> 09:29.840
getting intel sapphire and emerald rapid CPUs support into or chipsets support into core boot.

09:30.560 --> 09:36.800
I think the super micro nodes we had in production previously, they had a sort of work in progress,

09:36.800 --> 09:42.160
core boot port for that, but it never finished. Seeing a lot more work being done, though, there.

09:42.240 --> 09:46.560
From the open source community, we've seen obviously great work from free MDB and nine elements

09:46.560 --> 09:52.160
over the past five, six, seven, eight years, plenty others in the community as well. We've got red

09:52.160 --> 09:58.560
fish on through an IPMI solution. Also, fun facts. You find the case where in video GPU

09:58.560 --> 10:04.000
trays have their own BMC, which is good fun when you're trying to control your infrastructure.

10:06.960 --> 10:11.440
Right, with that now we have hardware, we have firmware, but we need something to run it on. We have

10:11.440 --> 10:15.760
a lot of workloads, both from the customer workloads, of course, the people are actually paying,

10:15.760 --> 10:20.480
you have internal workloads, all of your services that you need to run the customer workloads and

10:20.480 --> 10:25.040
supporting workloads, things like monitoring, logging, the bread and butter stuff that everybody does.

10:26.000 --> 10:31.600
We decided to go with Kubernetes here and not out of any deep love for the technology, but just

10:31.600 --> 10:36.320
it does the job, it's widely known, it's widely supported, whatever, pick your battles, that's not

10:36.320 --> 10:41.040
something to spend a lot of time on when there's something to work. Need to stress, we use this

10:41.040 --> 10:45.360
purely internally, it's not like the case we mentioned earlier where people just let people

10:45.360 --> 10:49.520
let the customers run parts and their Kubernetes know, this is purely internal, all

10:49.520 --> 10:53.280
customer workloads are behind the virtualization barrier, just for security reasons.

10:55.120 --> 10:58.720
The nice thing about Kubernetes is that you can also use it as a platform for managing your

10:58.720 --> 11:04.160
own resources, so called CRD is, this is quite common for projects that build on top of

11:04.160 --> 11:09.760
Kubernetes, things like QBurt or Rook or whatever, but you can also use it for your own application.

11:10.800 --> 11:14.720
Quite convenient, like all of our internal state things like Overlay Networks, instances, IP

11:14.720 --> 11:18.720
allocations, all of it is model as a customer resource and Kubernetes, and that gives you

11:19.520 --> 11:23.920
right, and that gives you automatic API implementation, things like patching endpoints,

11:26.240 --> 11:33.440
resource validation, watch streams, libraries, text UIs, all of that for free, which is

11:33.520 --> 11:38.880
quite convenient, and then the rest of the stack, all of the internal services that I mentioned

11:38.880 --> 11:43.760
are basically built as controllers, operating on these resources. In our case nowadays, mostly written

11:43.760 --> 11:48.880
in Rust, similar reasons as Kubernetes not out of any deep particular left, but because a lot

11:48.880 --> 11:52.480
of people on the team knew it, the others want to post to it, so it's what we use now.

11:54.240 --> 12:00.960
Boxstation is nice, but you need a new devs host to run it on. For that, the traditional model is

12:00.960 --> 12:06.320
probably put the server in a data center, installs an OS on it, maybe have a network boot

12:06.320 --> 12:13.200
based solution that images the servers. Problem about that is we did this earlier in earlier

12:13.200 --> 12:16.720
iterations, and over time you end up with a lot of state drifted, you're like, oh, we kind of

12:16.720 --> 12:20.720
reboot this machine, this important customer on there, so at some point the images two years out of

12:20.720 --> 12:25.760
date, and then you have something with Ansible and all of the machines are different, and it's not great.

12:25.760 --> 12:30.320
So solution here is, again, as mentioned earlier, minimize state wherever possible,

12:30.400 --> 12:35.200
in our case this means that the servers are completely ephemeral. They boot over the network,

12:35.200 --> 12:39.680
they load OS into Rambus, can just run from that, and when you reboot them, they just pull

12:39.680 --> 12:45.680
whatever is the latest image, and now you have to use data on them. Some status required, of course,

12:45.680 --> 12:51.120
in practice the Kubernetes control plane needs to persist things across reboot, and set for storage,

12:51.120 --> 12:56.480
of course, also needs to be persistent, so that just gets mounted in after the image boots,

12:56.480 --> 13:03.040
but everything else is stateless. For the images, we built them with make OSI, or however you can

13:03.040 --> 13:07.680
set it's part of the system D project, it's quite new tool, we use it to build a Debbie and

13:07.680 --> 13:13.200
base images, nothing particularly fancy there. The only interesting part is that we don't have a

13:13.200 --> 13:17.440
classic root of us for network booting, but we just pack the entire distribution into the

13:17.440 --> 13:22.320
inner drd, because, well, it's running from rm anyways, you might as well, and you skip the

13:22.320 --> 13:26.240
entire part where you now also need to configure your inner drd to bring up the network to

13:26.240 --> 13:32.080
pull the root of us, just a nice simplification. Speaking of network boot, this is roughly

13:32.080 --> 13:38.160
the flow we have there. So, traditionally what you would have, you would have a pixie rom, which

13:38.160 --> 13:42.720
can chain loads into i pixie, which fetches a config, which fetches a kernel, an inner drd,

13:42.720 --> 13:47.120
which boots, which brings up the network again, which fetches the root of us, which then boots.

13:47.120 --> 13:50.480
A lot of steps here already mentioned we're cutting out the root of us, another thing we're

13:50.480 --> 13:56.160
cutting out is the whole pixie thing. We use uefea, hdp booting, tends to be quite well supported,

13:56.400 --> 14:03.520
by firmware nowadays. So, basically the firmware of the box just brings up hdp, does hdp request,

14:03.520 --> 14:09.040
fetches a uefea binary boot center. You can pack linux and an inner drd enter uefea binary nowadays.

14:10.960 --> 14:16.480
One more thing we do here is config for the host. So, when the system boots, it sends a request

14:16.480 --> 14:21.280
to this net boot service that I mentioned here, this looks to netbox, fetches configuration

14:21.280 --> 14:26.560
for the server, packs it into an inner drd, appens it to the image, packs it into a uefea

14:26.560 --> 14:35.600
merge, strings that out, server boots it, goes into linux, and then it's pid1 is a quick utility

14:35.600 --> 14:39.840
that rend us out templates in the image with the config values to come from netbox,

14:39.840 --> 14:43.920
and then it hands over to system d interest brings up the system normally. Sounds quite hecky,

14:43.920 --> 14:48.240
but works very well in practice, and we don't need to worry about being persistent across

14:48.240 --> 14:52.960
reboots because everything is a few mirror line ways. So, with then then we come to networking,

14:52.960 --> 14:58.560
which is arguably the core of a cloud operation, because there's a lot of features, and so only

14:58.560 --> 15:03.680
you want to build the tip end of the network. So, there's a lot of thought you want to put into getting

15:03.680 --> 15:08.880
that right, scalability flexibility are really important here. So, what we do in practice is we

15:08.880 --> 15:13.440
have a strict separation between underlay network, whose only job is to connect these servers,

15:13.520 --> 15:18.880
and overlays, which then connect the workloads on top. Underlay, as I said, only jobs that

15:18.880 --> 15:22.880
this server can talk to each other, you can do it very basic, just connect it to a switch, and

15:22.880 --> 15:27.600
PhDP, whatever, done. If you have a big operation, you want to want something

15:28.640 --> 15:32.800
more powerful, something that we found there, which works really well, is routing to the host,

15:32.800 --> 15:37.280
every host has a beach pdman on it, has a loopback IP, peers with the switches, and houns

15:37.280 --> 15:41.200
a stat, that way you can build whatever network to apology you want, you can do things like having

15:41.200 --> 15:46.240
two network cards, uplink to separate switches without having to screw around, m-leg or whatever,

15:47.120 --> 15:54.480
everything just works. Under switches, we like Sonic, which is a switch from where open source,

15:54.480 --> 15:59.920
based on Linux, has some problems in practice, but you can make it work, and when you make it work,

15:59.920 --> 16:03.200
it's really nice, because then you can do whatever you want, which is fully programmable,

16:03.200 --> 16:10.640
which is very powerful. For the overlay, we use Vxline EVPN, simply again, standard, everybody supports

16:10.720 --> 16:16.560
it, hardware supports it, no reason not to. The underlay logic is isolated into its own network

16:16.560 --> 16:20.320
namespace, so this is roughly what the host looks like, the actual network cards are here,

16:20.320 --> 16:24.480
in the network namespace, there's a beach pspeaker, and then we have the Vxline

16:25.520 --> 16:32.160
interfaces and bridges here, there we have a one Kubernetes overlay, which then is hooked up into the

16:32.160 --> 16:38.480
main host, and it namespace sort of where Kubernetes and SolarLifts, so our cluster network

16:38.560 --> 16:44.000
and everything just goes through this overlay, then we have a public overlay more on that next slide,

16:44.000 --> 16:48.000
which is used to connect workloads to the internet, and then of course we have tenant overlays,

16:48.000 --> 16:52.880
which is just L2 plus DHCP for our customer, to connect customer instances together,

16:52.880 --> 16:59.920
kind of like a private network. For this public overlay, it's sort of a arguably at the end,

16:59.920 --> 17:05.440
but very basic and just built in-house to do what we need to do, again, to have flexibility there

17:05.440 --> 17:10.640
for offering various features, the logic here is this endpoints in this public overlay,

17:10.640 --> 17:15.680
things like instances, load balances, services like object storage, and of course the edge routers,

17:15.680 --> 17:20.480
which can act as to the internet, they also sit in this overlay, they announce switchability in

17:20.480 --> 17:25.280
for about themselves and about the routes or prefix exit that they're advertised, again,

17:25.280 --> 17:30.320
this goes into Kubernetes resources, one resource endpoint, so this looks like this for example,

17:30.320 --> 17:34.720
so then the endpoints says, this is my IP and this overlay, this is my MAC address and

17:34.720 --> 17:38.960
these routes, please send them to me. If you curious about this port sting, next slide.

17:40.160 --> 17:45.600
This is a public IPv6 space, again, just hooked up to the internet, but you can also route IPv4

17:45.600 --> 17:50.080
traffic via an IPv6 next top, of course, it's just simplified things to not need to do as

17:50.080 --> 17:55.760
like addressing there, because IPv4 addresses are quite scarce nowadays, and then we just have a network

17:55.760 --> 18:01.280
demon on each host, which watches all of these resources, and then sets up the networking state

18:01.360 --> 18:05.360
within the endpoints accordingly, so we pre-populate the neighbor table based on the

18:05.360 --> 18:09.600
back information that's there to avoid neighbor discovery flooding in the VX-lan,

18:09.600 --> 18:14.800
we populate the route table, potentially with multi-hop based on what each node advertises,

18:15.600 --> 18:20.400
and that ends up working quite well there. Of course, we want to hook it up to the internet

18:20.400 --> 18:24.000
eventually, so then we also have one component that takes all of this and translates it into

18:24.000 --> 18:28.720
BGP announcements, which then can go out to some hardware router that's connected to the internet.

18:28.720 --> 18:33.200
The reason why we don't just use BGP for everything in here is, again, flexibility,

18:33.200 --> 18:39.200
things like announcing specific ports, you can't really do there. We have status layer followed

18:39.200 --> 18:44.560
balancer in their custom control plane, data plane is basic, Linux IPv6 right now, we're looking

18:44.560 --> 18:53.520
at implementing something with the second chance routing that could up proposed in their

18:53.520 --> 18:57.760
gilded load balancer and cloud flannel also uses their very good blockposts about that.

18:59.120 --> 19:03.040
For traffic offloading to get more throughput, I mean, all of these servers have

19:03.040 --> 19:06.960
looked for, and a little bit next now doing that with Linux kernel networking is kind of challenging.

19:07.360 --> 19:12.400
One thing we did there in the past, and now looking at again is using VPP for this, so you bind

19:12.400 --> 19:18.560
the fiscal network cards to VPP modifier network demand to push all this data into there,

19:18.560 --> 19:23.200
and then you use a tap-and-faces to connect traffic back to containers applications,

19:23.200 --> 19:29.520
like Kubernetes, and VMs that are directly hooked up with the host user. Of course,

19:29.520 --> 19:34.400
nowadays, the very modern thing is to use a diffuse and offload everything there, but not quite

19:34.400 --> 19:40.400
there yet. Networking for instances, again, each instance gets a public IPv4, subnet, and a

19:40.400 --> 19:47.120
delegated network, instance traffic to the internet, just regular firewall, net fill the contract

19:47.120 --> 19:53.040
works well. We run this in the instance to avoid, again, having a central point of

19:53.040 --> 19:58.480
failure, because this needs to keep track of connection states. Public IPv4 address is optional,

19:59.200 --> 20:03.920
because the scarce alternatively just use a net. There's some trade-offs there, because you need it to be

20:04.800 --> 20:09.440
needed to be decentralized, again, you need to be scalable, but if you get several user ports,

20:09.440 --> 20:14.240
you need to be able to make back to the customer. So what we do there is we slice up public IPs,

20:14.240 --> 20:20.160
based on port number, we assign one port range to each instance, and then then we just

20:20.160 --> 20:27.120
make the instance net only for that. Audium A network, again, is important for GPU clusters.

20:27.120 --> 20:32.000
The classic number, there is around a 3.6-terror bit per node. Obviously, it needs to be hardware

20:32.000 --> 20:36.240
accelerated. Luckily, you usually have one network or a card per GPU, so you don't need to slice these up.

20:37.280 --> 20:41.200
For infinity band, you partition, the infinity band is one Audium A network, you partition with

20:41.200 --> 20:45.520
something called peak keys, it's kind of like a VLAN, sorry to skip over some things here for

20:45.600 --> 20:51.840
time reasons. Rocky goes over Ethernet, it's kind of similar, you can do basic VLAN isolation

20:51.840 --> 20:57.120
on the network card, but it's better to do it on the switch. Again, Sonic is an ICM. For storage,

20:57.120 --> 21:01.360
you want something that's fast and reliable, unfortunately, in all ways quite feasible. So in practice,

21:01.360 --> 21:06.400
we split it. Fast storage is just local and we meet on the server. Again, if you're

21:06.400 --> 21:11.280
in the middle, because we don't want to guarantee anything about state, so this is used for cashing,

21:11.360 --> 21:15.520
and then we have network storage, which needs to be robust. The local storage, it's easy to

21:15.520 --> 21:20.320
bottleneck on that. If you have four gen 4 NVMe drives, you want to slice it up for customers,

21:20.320 --> 21:25.360
you want to encrypt it, you see the bottleneck. So we use SPDK and practice there. It's again,

21:25.360 --> 21:31.840
like the PP uses space off load, but for storage, and we get 98% of the raw throughput,

21:31.840 --> 21:35.920
even though we do crypto and LVM and rate and pass you to VM. So that's quite nice.

21:36.880 --> 21:42.080
For the network storage, we use SAF. There's everything, file system exposed. We have

21:42.080 --> 21:47.520
a virtual IOFS, block storage to respdk, object storage, raw escape bay, and the Kubernetes cluster

21:47.520 --> 21:53.440
on it to re-rock. And then virtualization at the end, obvious option. There's the first glance

21:53.440 --> 21:58.640
is queued, we looked at that. But the thing is, last time we checked at least, it's designed to integrate

21:58.640 --> 22:03.440
legacy applications into your Kubernetes cluster, we want the exact opposite. We wanted to be isolated,

22:03.600 --> 22:07.440
and with all of the special things, we do hacking that into Kubernetes. At some point, it's easier

22:07.440 --> 22:11.760
to just say, you know what, we just run our own Qemu and a container. That's the job.

22:12.640 --> 22:16.320
You want to, so this is what typical APA GPU server looks like. You need to

22:17.440 --> 22:21.280
mind the enumerant apology there, so you want to select GPUs that are close together,

22:21.280 --> 22:26.400
then you want to select CPU cores on the same node. Then you want to pin the VCUs, both

22:26.400 --> 22:31.120
performance and for security, again, side-channel attacks nowadays, everybody's aware of that.

22:32.000 --> 22:35.840
And you want to pass this topology info to guess, so it can schedule accordingly.

22:37.280 --> 22:42.640
Then, for the last point, GPU specific things about virtualization, you need to be mindful

22:42.640 --> 22:47.760
of the PCA topology here. There's a lot of GPU to GPU and GPU to network hard traffic,

22:47.760 --> 22:52.800
which is PCA, IPA to peer. If the goal goes through your CPU root complex, you're going to

22:52.800 --> 22:58.000
bottleneck hard because there's not enough lanes there. Workloads need to know the topology,

22:58.000 --> 23:02.400
so that it doesn't accidentally send traffic across the entire node because it thinks it's

23:02.400 --> 23:08.160
local. You can configure this into the instance manually, but it's easier to just model it through

23:08.160 --> 23:14.800
Qemu by having like virtualized PCA switches in the topology. Classic pitfall there. If you just

23:14.800 --> 23:18.880
pass through these devices, you have the IOMM U and the data path, and that forces all the traffic

23:18.880 --> 23:24.480
to your CPU. So instead of your 2.6 terabytes, you're suddenly getting 180 gigabytes, not very nice.

23:24.560 --> 23:29.200
The way to avoid that, there's a feature called address translation services that you're

23:29.200 --> 23:33.600
able on the network hard, and then you're able directly to traffic and access controls service

23:33.600 --> 23:39.280
on the PCA switch. You just Google that, find the commands for that, and then you get your

23:39.280 --> 23:43.040
full traffic. In that case, you need to trust the network hard firmware to not do malicious

23:43.040 --> 23:47.680
DMI attacks on your host. Well, this is just something you have to do in this case to be honest.

23:48.400 --> 23:53.600
Finally, the last point you want your GPU into the connect, something like enabling,

23:53.600 --> 23:59.280
or a link for more general solution. If you do GPUs one at a time, it's really easy.

23:59.280 --> 24:03.280
You don't have GPU in the connect, just pass through the GPU you're done. If you pass all of

24:03.280 --> 24:07.680
the GPUs through it's also easy, you just pass everything through and let the guest manage it.

24:08.880 --> 24:13.600
If you're mixing it, it's difficult because now you need to configure which GPUs are allowed to

24:13.600 --> 24:18.480
talk to which very, very vendor specific right now. Unfortunately, I hope that this is

24:18.480 --> 24:23.840
the area that will get more support and standardization in the future.

24:25.840 --> 24:29.760
With that to the quick tour, unfortunately, you couldn't go into a lot of details on

24:29.760 --> 24:34.400
many things just because again, big pictures, small times that. But if you have any questions,

24:34.400 --> 24:39.200
any follow-ups, if you've read talk to any of us at the event, send us an email. If you want to

24:39.200 --> 24:41.920
learn more about any of this, let us know and we can do another talk next year.

24:41.920 --> 24:46.240
Thank you very much everyone.