WEBVTT

00:00.000 --> 00:10.840
Alright, hello everyone. Thanks for coming. My name is David Branstill and this is

00:10.840 --> 00:20.640
the talk on why our Android builds so damn slow. So just a little bit about me. I spent

00:20.640 --> 00:25.760
about 10 years working at Google on the Android operating system. I was an engineer on the Android

00:25.760 --> 00:32.240
runtime team working on optimizations in the Java stack and Android security else

00:32.240 --> 00:38.160
the technical lead for the hypervisor and virtualization framework projects and now it's

00:38.160 --> 00:44.680
sort of the dev we work with device manufacturers who use Android in their product and we try

00:44.680 --> 00:51.360
to optimize their workflows so they're more efficient. And the number one complaint we keep

00:51.360 --> 01:00.080
hearing from the ecosystem is Android builds and checkouts are just crazy crazy slow. Just

01:00.080 --> 01:06.160
to give you some context. I mean you just heard from Stefan how huge the Android codebase

01:06.160 --> 01:13.280
is checking it out, downloading the source code to your computer easily takes between 15 and 20

01:13.280 --> 01:19.120
minutes. Most of this is just because it's so damn huge. But the other factor is that the

01:19.120 --> 01:25.600
OSP mirror can be quite slow and a good prototype is to use the authenticated access because

01:25.600 --> 01:30.480
that's going to get around some of the usage quota. Once you have the source code you want to

01:30.480 --> 01:38.560
build it that typically takes around two hours or roughly four and a half dollars depending

01:38.560 --> 01:45.280
on where you get your machines from and again this comes down to the fact that it's so damn

01:45.280 --> 01:52.640
huge. There's about 250,000 build steps that the machine needs to go through and obviously if you

01:52.640 --> 01:56.800
spend more on your hardware it's going to go faster but the costs are also going to go up so

01:56.800 --> 02:03.440
there's a trade-off that you have to balance. And then as you develop you're going to go through

02:03.440 --> 02:11.680
incremental builds and those can take 10 minutes, 30 minutes, it all depends on what you've

02:11.680 --> 02:16.720
changed and how much of the codebase needs to be rebuilt at that point. But there's also another

02:16.720 --> 02:23.680
problem with Android and that's the fact that it doesn't have a good enforcement of the dependency

02:23.680 --> 02:29.920
declarations. So sometimes you end up changing something and it doesn't get correctly rebuilt and

02:29.920 --> 02:35.520
it uses an older version of the build artifact. So at that point most engineers just wipe the

02:35.520 --> 02:39.920
build directory and start from scratch and it's another two hours to get back to a working state.

02:40.240 --> 02:48.640
I'm going to be quoting some numbers here. I'm already M just to make it reproducible. I picked

02:48.640 --> 02:56.080
a virtual machine on Google Cloud for the 2VCPU 120 gigs of RAM and VMSSD costs about $2

02:56.080 --> 03:04.480
an hour and all of the numbers are for the latest Android 16 QPR 2 release and it's building the

03:04.480 --> 03:11.120
AOSP cutoffish ARM 64 phone target. So if you want to try it, sorry.

03:12.960 --> 03:20.240
Now, okay I'll speak louder. So if you want to reproduce these numbers you should be able to buy

03:20.240 --> 03:29.200
by creating a similar environment. So historically the answer to how are we going to fix this

03:29.280 --> 03:36.720
has been that Android will eventually move to Bazel. Bazel's the build system that's used in

03:36.720 --> 03:41.200
the Google Monetary power. It has been open sourced. It has all these great properties of

03:41.920 --> 03:48.320
permeticity, reproducibility. It has build and test caching. It scales really well.

03:49.680 --> 03:58.480
So the original plan from AOSP has been to gradually migrate from the old make files to Bazel.

03:59.280 --> 04:08.240
A bunch of steps. Step one, develop a build system. It's now called soon, which uses the Bazel file format,

04:09.120 --> 04:16.160
but it's backwards compatible with make. Step two, gradually convert all your make files to these

04:16.160 --> 04:23.120
new BP files. You saw that in Stefan's chart as well. And then start gradually building some

04:23.200 --> 04:29.600
parts of the system with Bazel and gradually phase out soon and the original make.

04:30.320 --> 04:36.000
Step four, profit. I get all the benefits from Bazel and the associated infrastructure.

04:36.960 --> 04:45.200
The effort culminated around Android 14 early 2023 when most of the AOSP codebase has been migrated

04:45.280 --> 04:53.360
to soon. Give or take some device config files. But the adoption downstream has remained

04:54.240 --> 04:57.920
a lot lower, especially in BSPs. You still see a lot of make files.

04:59.360 --> 05:07.440
But in early 2023, we started seeing Bazel builds on CI. Android.com. So some parts of the system

05:07.520 --> 05:15.120
started to be built with Bazel. They're all looked great. And then in typical Google fashion,

05:15.680 --> 05:22.960
we got a notification that Bazel isn't a supported build system anymore. The migration has been

05:22.960 --> 05:32.480
halted in late 2023 and outside of building the Android kernel soon is the main build system from now.

05:32.640 --> 05:41.600
So that kind of reason isn't a bind. We're in a state where the migration hasn't been

05:41.600 --> 05:47.280
completed. We have soon, but soon isn't Bazel. It doesn't have those great properties of

05:47.920 --> 05:52.800
permeticity and reproducibility. It's sort of tries. There are some mechanisms to

05:52.800 --> 05:59.520
sanitize the path force. Certain tool change to be used. There's a sandbox for generals. But

05:59.520 --> 06:03.520
under the hood, it's still compatible with make. And you can do anything with a make file. We can

06:03.520 --> 06:13.280
bypass all these great sandboxing mechanisms. So the reason why Bazel builds are fast is that

06:13.280 --> 06:18.560
it builds on these properties and uses them in a really clever way. But because soon doesn't have

06:18.560 --> 06:24.720
doesn't have those properties, you can't really do that. So in Bazel, all inputs and outputs

06:24.800 --> 06:30.320
are declared upfront. You don't have that with soon. It's easy to send box tasks in force,

06:30.320 --> 06:36.000
dependencies, capture results. You can't do that with soon. It's easy to upload

06:36.000 --> 06:40.320
computation to another machine. If you have a build farm, especially in a large organization that

06:40.320 --> 06:46.240
can be really helpful, you can't do that with soon. So the question is, where in this state,

06:46.960 --> 06:54.560
what can we actually do? How much of these sorts of features can we get with the build system that

06:54.560 --> 07:00.080
Android currently uses? And we'll probably use for the foreseeable future. I'm also going to

07:00.080 --> 07:06.800
reference a talk from Chris Simmons at embedded OSS 2020 free where he deep dives into

07:07.920 --> 07:12.880
how soon and all the other components work. I'm not going to go through that here. I'm just

07:12.880 --> 07:18.000
going to tell you the important bits. But if you're interested, go and check out that video. But in short,

07:18.320 --> 07:24.560
the way that the build system works is that we have two kinds of build files. Android BP,

07:24.560 --> 07:32.080
those are the ones that are very similar to Bazel's StarLark language. And these are just templates

07:32.080 --> 07:38.240
and all the build logic actually lives in Go. And then we have the legacy Android MK files

07:39.680 --> 07:44.960
that are still used in a bunch of places. And the way that the build works is that you have

07:45.040 --> 07:52.240
four main components. Soon UI is the top level process. It drives the whole thing. It also renders

07:52.240 --> 07:58.880
the progress bar. And then you have soon underscore build, which takes the blueprint files and

07:58.880 --> 08:05.200
compiles them down to Ninja and Kati, which does the same thing for make files. And that's how the

08:05.200 --> 08:12.320
two systems can coexist. They both compile down to the same low level format and they can reference

08:12.400 --> 08:20.240
each other if necessary. And this process, if you've ever built Android, is really annoying because

08:20.240 --> 08:26.160
depending on how fast your SSD is, this can take anywhere between two and seven minutes. And every

08:26.160 --> 08:31.120
time you touch one of these files, you have to go through the process again. The reason why it's taking

08:31.120 --> 08:36.960
so long is because it loads everything into RAM. There's a lot of analysis. It uses about

08:37.040 --> 08:44.640
40 gigs of RAM and then writes about seven gigabytes of data on disk. So if you're changing

08:44.640 --> 08:51.360
the config files, the build files often, you're going to be running into this over and over. But once

08:51.360 --> 08:59.520
this is done, you get a set of the build. Ninja files and you execute Ninja and then just goes through

09:00.160 --> 09:07.520
roughly 250,000 individual build commands and executes them with as much parallel as it gets.

09:07.520 --> 09:19.520
That's the bulk of the build. And if you're ever trying to debug while your builds are slow,

09:19.520 --> 09:25.680
one indispensable tool is looking at the build traces. This is a file that's generated during the

09:25.680 --> 09:34.400
build. It's in out slash build.trace.gz. And you can take this file and open it in any

09:34.400 --> 09:40.640
perf viewer. The one I like is Profito.dev. You can just upload the file there and it gives you this

09:40.640 --> 09:47.680
beautiful UI where you can see everything that's happening. And we can see those stages here.

09:47.680 --> 09:55.360
So this red one that's building the blueprint files, this is building make files that on this

09:55.440 --> 10:02.720
machine took two and a half minutes and the rest is Ninja compiling all the little bit and pieces

10:03.760 --> 10:12.880
that you actually need for the for the system. And before we dive into caching to some quick

10:12.880 --> 10:18.800
tips for how to make it run faster. So obviously, nobody has unlimited budgets. We have to make

10:18.800 --> 10:25.840
some trade-offs. So what's a good trade-off in terms of hardware? For the CPU, more cores is always

10:25.840 --> 10:33.520
better. You can get so much parallelism out of that main Ninja step that if you can get more cores,

10:33.520 --> 10:42.560
get more cores. 24, 32 is a good baseline, but 6428 is just going to make it go so much faster.

10:43.280 --> 10:49.760
More recent architectures are usually better, like they're always better, but it's not as much

10:49.760 --> 10:58.160
of a win as just raw parallelism. RAM, you want about two gigabytes per logical core. So if you have

10:58.160 --> 11:05.680
32 cores, you want 64 gigs of RAM, but that's typically the minimum. There are some spikes, so you

11:05.760 --> 11:13.200
do want to have a swap roughly the same size, just to be able to absorb those spikes every now and

11:13.200 --> 11:18.640
then otherwise, the next might decide to kill your kill your belt. But the most important one is

11:18.640 --> 11:27.200
storage. Low latency is the most important aspect in a belt. It doesn't matter how fast is the

11:27.200 --> 11:33.600
transfer speed, latency is the king. And the reason is because the working directory has about four

11:33.600 --> 11:41.520
million files and the median size is about 300 bytes. So these are millions and millions of

11:42.560 --> 11:46.880
teeny tiny files and you just need to be able to create them on the file system as quickly as you can.

11:47.760 --> 11:54.240
It's the same with the soon underscore belt face. It keeps scanning the code base

11:55.440 --> 11:59.760
and scans about one million paths, but only reads a handful of those files.

12:00.640 --> 12:05.840
So again, just how quickly your SSD can respond is the most important aspect here.

12:07.760 --> 12:15.600
Right. So back to caching and re-couping some of the features that Basil has, but soon doesn't.

12:16.880 --> 12:22.400
So if you just want a local cache, there's the good old C cache. Everybody's heard of that.

12:22.720 --> 12:31.680
And Android has supported pretty much since the very beginning. At some point there was a pre-built

12:31.680 --> 12:38.080
binary in the code base. That's now gone, but you can still bring your own. If you just install

12:38.080 --> 12:43.200
C cache in Ubuntu and you point Android at that binary, it's going to work just fine.

12:44.320 --> 12:49.360
The way you set it up is you need these three environment variables, the top of the terminal.

12:50.240 --> 12:56.800
UC cache equals one, just enables it. C cache, C cache, dear is the place where you store the data

12:57.520 --> 13:04.880
and C cache, exact is the binary itself of C cache. And then C cache, dash capital M,

13:04.880 --> 13:15.920
sets the maximum size of the cache. 20 gigs is enough for AOSP 16 and then you just run M to build

13:16.560 --> 13:22.320
Android and at the end it's going to tell you how long it took. If you remember in the beginning

13:22.320 --> 13:29.760
I said the baseline was two hours, it's about our 57. In this case C with C cache fully populated,

13:29.760 --> 13:38.240
it finished in 47 minutes. So we spent 20 gigs of local storage and we get 58% improvement in

13:38.240 --> 13:45.120
build times. And if we print the statistics from C cache, it will tell us that it had a cache hit

13:46.000 --> 13:57.360
126,000 targets out of a total of 244,000. So it's pretty much exactly one half of the whole

13:57.360 --> 14:05.040
build is C and C++ that can be cached with C cache. So this is a really low effort thing that you can

14:05.120 --> 14:17.120
do and it's going to make a huge difference. Sorry, no. But the state of the art in AOSP in terms

14:17.120 --> 14:25.360
of build caching is a component called RE client. This is similar to C cache in the sense that it's

14:25.440 --> 14:32.800
a wrapper around a compiler, but in this case it connects to bezels RE infrastructure.

14:34.560 --> 14:40.640
And compared to C cache which is only for C and C++, RE client has support for a lot more

14:40.640 --> 14:48.800
different kinds of compilers and tools. So there's client, there's a linker, javasy, metallava,

14:49.040 --> 14:58.160
RE, D8, sine APK, and a bunch of smaller tools. And our client can take those and make them

14:58.160 --> 15:06.560
cacheable. So for this you need an RBE backend. Backend has two kinds of services. The first one

15:06.560 --> 15:13.680
is that it's a distributed cache. So clients can upload results of their computation. And then

15:13.680 --> 15:20.160
either they themselves later or somebody else can download the same result later on. And in a

15:20.160 --> 15:25.600
larger organization there's also the support for remote executions. So you might have a build farm and

15:25.600 --> 15:33.520
you can submit those individual jobs to the build farm and get the results back. There are a lot of

15:33.520 --> 15:40.720
different backend solutions. This is a standard protocol. The open source ones all have very similar

15:41.680 --> 15:50.160
build barn build farm and build grid. They all are under Apache 2.0. And if you go to the official

15:50.160 --> 15:58.400
bezel website there's there's a list of other ones and commercial ones and so on. So how does

15:58.400 --> 16:07.600
RE client work? It has this command called RE wrapper. You prefix your commands with that and

16:07.600 --> 16:15.200
Android would do it for you. And RE wrapper its job is to create this data structure called action.

16:15.200 --> 16:21.280
It's part of the standard of the protocol. And that describes what is about to be executed.

16:22.000 --> 16:28.800
But here we hit the difference between bezel and zoom. Because in bezel you have all the information

16:29.600 --> 16:36.880
available upfront. It's all statically declared and you can find it immediately. Whereas RE wrapper needs

16:36.960 --> 16:45.120
to do a bit of work to fill in the gap. So for the command and environment that's statically available

16:46.240 --> 16:53.360
if it parses the command line that's being executed it can figure out which compiles being used

16:53.360 --> 16:59.200
with our domain source files. So it can populate some information about the inputs. But for

16:59.280 --> 17:07.760
example to figure out which header files will be included by the compiler laid on it needs to run

17:07.760 --> 17:14.160
the preprocessor. So it's a bit of work to get to that information. And finally those input files

17:14.160 --> 17:20.640
need a hash. So it needs to hash those files before it can port can talk to the back end.

17:21.280 --> 17:27.280
Once it has the action struct it sends it to the back end the back end depending on how it's

17:27.360 --> 17:33.520
one figure it will either execute there or look in the cache. And it gives back another data structure

17:33.520 --> 17:40.640
code action result. That's the same thing but it describes the outputs of that particular action.

17:40.960 --> 17:46.720
And it's just an exchange you know you do this a few hundred thousand times a build and that's how

17:46.720 --> 17:52.400
that's how you that's how the caching works. So to set this up on your own machine

17:53.200 --> 18:02.400
uh first you need a RBE back end I'm not gonna go through that here. It depends on that particular

18:02.400 --> 18:07.840
project and it's quite well documented. But on the client site and Android you need to create

18:07.840 --> 18:18.160
a config file. There's an example in build soon docs RBE that JSON really well hidden and you need to

18:18.240 --> 18:24.560
populate things like the IP address credentials then you select the two chains that you want to use this for

18:25.520 --> 18:30.080
and unfortunately I haven't been able to find one place where they would just have a list.

18:30.720 --> 18:37.200
But if you go to code search cs.android.com and you just search for RBE underscore you'll find all the

18:37.200 --> 18:44.480
strengths. And finally you need to select the exact caching strategy so it's what happens on a cache

18:45.440 --> 18:52.960
and there are three options remote local fallback so execute remotely fallback to local if it fails

18:54.080 --> 18:59.040
local that's the one I'm using in the example which is no remote execution just treat the cache as a

18:59.040 --> 19:05.840
distributed cache nothing else. And finally my favorite racing which means you you tell the remote

19:05.840 --> 19:11.840
server to execute but you also do it locally and whoever gets their first that result gets used.

19:11.920 --> 19:18.960
And once you have this config file then you just need to again set some environment variables

19:18.960 --> 19:26.000
to point Android at that config file. So the way this looks is two variables at the top the first one

19:26.000 --> 19:33.840
is the directory the other one is the name of the file minus the.json for some reason. And with that you can

19:33.840 --> 19:38.960
run m first time you need to populate the cache second time it's going to be fully populated

19:39.440 --> 19:50.320
prints the stats at the end. So from this we can see that there were 156,900 local executions so that's

19:50.320 --> 19:59.680
the number of times that the wrapper was invoked and of those 144,700 were cache hits and that's out of

19:59.680 --> 20:11.200
247,000 targets it for some reason adds 3,000 targets. So this is 64% coverage of the build and the build

20:11.200 --> 20:19.840
completed in 30 minutes 25 seconds which is 73% faster compared to the baseline. And this is a clear

20:19.840 --> 20:26.400
improvement over C cache but to me it's actually quite disappointing because we've gone through all this

20:26.400 --> 20:32.960
work of you know a standard protocol distributed cache support for 10 different two chains and all we

20:32.960 --> 20:42.080
get out of it is another 14% of coverage and another 15% speed up. So it kind of shows you that there's

20:42.080 --> 20:50.320
a limit to how much you can do with compiler wrapping and this general approach of retrofitting the

20:50.320 --> 20:59.760
RBE protocol to AOSP. But this is the state of the art this is the best that you can do with

20:59.760 --> 21:08.320
AOSP today. So if that what's next what can we actually do better here? Well ideally we'll get better

21:08.320 --> 21:16.960
coverage less local pre-processing but that's tricky because this is a heterogeneous code base

21:17.040 --> 21:23.920
every project is a little different it has a million different programming languages. This is

21:23.920 --> 21:29.200
probably not unique to Android the Octo is very similar in this regard. So it's some sort of generic

21:29.200 --> 21:39.040
mechanism for isolating and monitoring those individual build commands so that we can we can

21:39.040 --> 21:46.880
move them around cache them find the find the inputs and outputs. It will also nice to speed up the

21:47.200 --> 21:53.600
Ninja file generation like we said it's somewhere between two and seven minutes you run into it

21:53.920 --> 22:02.720
every now and then ideally the algorithm itself would be faster and anecdotally I feel like in the

22:02.720 --> 22:10.480
past couple of releases it's got a little better but it's also quite you know ram intensive and

22:10.480 --> 22:16.560
I'll have a so there's probably a limit to how much how much you can do there. The other option here is

22:16.640 --> 22:23.040
extending caching to this early stage to cache the result of zoom build and cate so that

22:24.160 --> 22:28.480
if you're an organization somebody else has gone through this or if you're switching between branches

22:28.480 --> 22:36.080
and you have a local cache you don't have to go through this every single time and you will have

22:36.080 --> 22:41.920
notice that we haven't really talked about checkouts and that's that's also still a problem especially

22:41.920 --> 22:51.040
in CI where you don't have that persistent state or if you're an engineer who keeps switching between

22:51.040 --> 22:56.640
Android branches because you need to cherry pick fixes into different releases and so on.

22:58.080 --> 23:03.200
In that setup you keep going through that checkout over and over and it's also quite painful

23:03.440 --> 23:15.200
but again going back to Stefan's talk actually about 50% of the disk usage is coming from pre-built

23:15.840 --> 23:22.640
and pre-built have all sorts of like versions of client and they're built for Linux and Darwin

23:22.640 --> 23:28.640
and really what you need during the build is one version for one architecture so a lot of that

23:29.360 --> 23:35.280
time during the checkout is actually completely wasted and even in the source code you don't always

23:35.280 --> 23:41.280
rebuild everything you don't need all the definitions for all the devices you don't need cts when you're

23:41.280 --> 23:47.440
just building the operating system itself so there's there's a lot that we're downloading for

23:47.840 --> 23:56.400
no good reason there actually used to be a solution for this get vfs is a Microsoft project

23:56.400 --> 24:04.640
for a virtual file system forget which would allow us to do you know on demand downloading of

24:04.640 --> 24:08.720
of these files but unfortunately the project has been deprecated so that's also

24:09.520 --> 24:16.560
also kind of stuck in that sense and that's the end of my talk these are the questions that

24:16.560 --> 24:22.320
we're asking ourselves in search of the death and we were having an announcement if you want to see

24:22.400 --> 24:28.560
demo of how we do checkouts in 30 seconds and built in five minutes then we have a video for you

24:31.280 --> 24:38.160
and yeah if you if you're interested in what we're doing we're hiring so that's an email

24:40.160 --> 24:42.560
and yeah that's the end thank you

24:42.960 --> 24:54.960
um do any questions yeah go ahead

24:54.960 --> 25:08.560
I have heard of a bfs I don't think I'm allowed to talk about that sorry go ahead

25:08.960 --> 25:17.360
are the questions yes problems when you have one version of a base file in the cache and they're

25:17.360 --> 25:21.760
building using our cache a newer version of a base and some break in the question you'll

25:21.760 --> 25:28.960
build so I've had that experience uh put them uh so I don't think I've understood

25:29.920 --> 25:36.320
you mentioned that the song doesn't properly drag all the inputs right everything that

25:36.320 --> 25:43.520
influences our content so you have never had bridges using our new cache right you have passed

25:43.520 --> 25:47.840
based on the wrongs for those but you're building newer source but they're okay so I'll just

25:47.840 --> 25:54.480
repeat the question uh it was this our client ever run into the issue of not tracking the dependencies

25:54.560 --> 26:01.120
properly uh and I have not seen that um it's and that's because it's spending that effort

26:01.120 --> 26:08.640
trying to preprocess the uh the local inputs to really uh to really figure out what the compiler

26:08.640 --> 26:15.840
will need later on are you only using a pure google baseline when you're doing the testing

26:15.920 --> 26:26.000
yes uh in this case pure aosp um yeah I mean though the support for if you look at our

26:26.000 --> 26:33.680
client it was built originally for chromium and then later extended for aosp but it's literally aosp

26:33.680 --> 26:40.560
like if if you have a vendor code base which uses a compiler flag that isn't used in aosp

26:41.280 --> 26:47.520
then it will not know how how it works and if it affects the build so it would just say sorry

26:47.520 --> 26:57.520
I can't I can't catch this time's up um all right thank you so much

