WEBVTT

00:00.000 --> 00:14.000
I'm going to talk about butter fest today, so two questions for you.

00:14.000 --> 00:17.000
Who here has a laptop with them, just right now.

00:17.000 --> 00:18.000
Show of hands.

00:18.000 --> 00:21.000
Okay, leave your hands up.

00:21.000 --> 00:26.000
If you know what the file system is on your laptop right now.

00:26.000 --> 00:33.000
I don't think anyone put their hands down, so my friends.

00:33.000 --> 00:37.000
I'm going, like the very brief outline for the talk.

00:37.000 --> 00:48.000
I'm going to talk how my company cuts our storage cost by 74% over the last 18 months of onboarding butter fest.

00:48.000 --> 00:55.000
And how we productionize it in the company and in the Google Cloud platform.

00:55.000 --> 00:58.000
Just a little bit of business context.

00:58.000 --> 01:03.000
Our coronavirus is a company selling observability.

01:03.000 --> 01:04.000
Softer is a service.

01:04.000 --> 01:10.000
It's a software observability, software is a service, which is selling among other things telemetry.

01:10.000 --> 01:11.000
So metrics.

01:11.000 --> 01:16.000
We do store a lot of time series data.

01:16.000 --> 01:26.000
And we have some big customers that we run thousands of database service servers per customer.

01:26.000 --> 01:34.000
And each disk of the database server is at least one terabyte in size.

01:34.000 --> 01:43.000
So there are usually more than a terabyte, but that runs to many or some petabytes stored in Google Cloud platform.

01:43.000 --> 01:50.000
In mid 2025, all of it was on EXD4.

01:50.000 --> 01:56.000
At ten at some point, we started thinking, why, well, what could better of has to do with it?

01:56.000 --> 01:58.000
One, it's compression.

01:58.000 --> 02:01.000
It can save a lot of disk space.

02:01.000 --> 02:06.000
And compression is fully transparent as I'll show you in the demo soon.

02:06.000 --> 02:12.000
And check sums in butter fest can save on CPU cycles as well and application complexity.

02:12.000 --> 02:22.000
Just to ground this into reality a bit, I'll show you the demo of in line of butter fest compression on my laptop.

02:22.000 --> 02:37.000
So I have sacrificed, I have sacrificed 256 of my disk of a partition for butter fest.

02:37.000 --> 02:47.000
So there is a file system there, I'll just mount it.

02:47.000 --> 02:54.000
Right, so here, I just know, I use the wrong arguments.

02:54.000 --> 03:05.000
You'll just see why.

03:05.000 --> 03:17.000
So I am compress, I am mounting the partition for butter fest on a directory and I'm telling butter fest to compress it with this standard compression algorithm.

03:17.000 --> 03:26.000
And I have a dump of English Wikipedia.

03:26.000 --> 03:35.000
First billion bytes of English Wikipedia snapshot from about 20 years ago to be pedantic.

03:35.000 --> 03:44.000
It's one gigabyte in SI units and in slang that would be 954 megs.

03:44.000 --> 03:49.000
So what does this English Wikipedia look like?

03:49.000 --> 03:52.000
It is an XML document, so a lot of text.

03:52.000 --> 03:58.000
There is useful stuff in this XML document.

03:58.000 --> 04:04.000
Let's try to search for a string.

04:04.000 --> 04:12.000
We can see the search works and who knew until today that was them is free open source software developers European meeting.

04:12.000 --> 04:14.000
Kind of new, but now I do.

04:14.000 --> 04:17.000
No, it's from Wikipedia from 20 years ago.

04:17.000 --> 04:25.000
So important things here, we can see that the file is almost gigabytes, like it's a gig on off data.

04:25.000 --> 04:31.000
We can read it with cat, grab, etc., we can open it with a text editor.

04:31.000 --> 04:38.000
You know, just a regular file, but what's interesting about this file, that it is stored compressed.

04:38.000 --> 04:45.000
So we can see that it uses 340 megs on disk.

04:45.000 --> 04:47.000
So about a third.

04:47.000 --> 04:51.000
This is like a 1 terabyte 2 terabyte disk.

04:51.000 --> 05:01.000
If I only start English Wikipedia or just Wikipedia, I would be able to use like to start 3x of them and would not have to change anything else.

05:01.000 --> 05:03.000
So it's transparent compression.

05:03.000 --> 05:09.000
The applications that we did do not see that they need to uncompress it or compress it.

05:09.000 --> 05:21.000
For regular desktop, since all of my laptop is on butterflies, we will see that it is compressing not only text data,

05:21.000 --> 05:23.000
but things like program data.

05:23.000 --> 05:25.000
So we will get back to this in a minute or so.

05:25.000 --> 05:31.000
So we just show you the demo where we show transparent compression on Wikipedia.

05:31.000 --> 05:33.000
It was around 60%.

05:33.000 --> 05:35.000
Yeah, around 60% compression.

05:35.000 --> 05:39.000
And that was 65%.

05:39.000 --> 05:42.000
35%.

05:42.000 --> 05:44.000
So that's 65% compressed.

05:44.000 --> 05:46.000
So we used 3rd of the disk.

05:46.000 --> 05:50.000
So this is exactly when we looked at butterflies over a year ago.

05:50.000 --> 05:56.000
We initial tests with the production data in Corona server, with the telemetry production data.

05:56.000 --> 06:02.000
Show that just if we move the data on from EXT4 to a butterflies with transparent compression,

06:03.000 --> 06:05.000
we cut the disk off by a third.

06:05.000 --> 06:07.000
Sorry, by 2 thirds.

06:07.000 --> 06:11.000
So we are down to minus 65%.

06:11.000 --> 06:13.000
This was huge.

06:13.000 --> 06:21.000
But once it's a lot of petabytes multiplied by, it could be huge savings.

06:21.000 --> 06:28.000
But immediately what jumped into my mind is the bad reputation from the past.

06:28.000 --> 06:36.000
If you have, if you saw my speaker bio, I tried to use butterflies 15, 10, 15 years ago and it ate the data.

06:36.000 --> 06:39.000
And it was, I still remember that.

06:39.000 --> 06:44.000
But I have used butterflies over also recently and it didn't.

06:44.000 --> 06:48.000
So the first question is, okay, the savings are significant.

06:48.000 --> 06:52.000
They're interesting, but the reputation problem is, is there.

06:52.000 --> 06:56.000
And if you look on the internet even today, there are people reporting data loss,

06:56.000 --> 07:00.000
or just unable to configure it, or other problems.

07:00.000 --> 07:02.000
It's all over the place.

07:02.000 --> 07:12.000
But there was one specific talk in OSCon 2020, which was documented by Linux Weekly News,

07:12.000 --> 07:19.000
by Joseph Batzik from Facebook, a kernel developer from Facebook working on ButterFest.

07:19.000 --> 07:25.000
And he was explaining how Facebook maintains and uses ButterFest on their millions of servers.

07:25.000 --> 07:31.000
And this talk was really a confidence booster that this is something that could actually work outside of my laptop,

07:31.000 --> 07:34.000
you know, broader for the company.

07:34.000 --> 07:42.000
So with the projected possible savings and this anecdote and talk from Facebook,

07:42.000 --> 07:47.000
we decided to have a better look to evaluate it actually.

07:47.000 --> 07:50.000
See if it could work and reduce our bill.

07:51.000 --> 07:58.000
So once we started looking at that, there were a few known unknowns that we knew that we'll have to go over.

07:58.000 --> 08:02.000
First confidence and reputation, I just talked about it.

08:02.000 --> 08:08.000
The talk mostly mitigated that while talking to my peers and the company.

08:08.000 --> 08:10.000
ButterFest is infamous.

08:10.000 --> 08:14.000
The other thing is ButterFest is infamous about reporting this space.

08:14.000 --> 08:17.000
But how much there's this space remaining?

08:17.000 --> 08:23.000
Because it manages the way it lays out files differently than other file systems.

08:23.000 --> 08:28.000
But this can be mitigated and it's pretty well documented.

08:28.000 --> 08:33.000
If you're interested, you can look up those terms, background reclaimed threshold and dynamic reclaimed.

08:33.000 --> 08:34.000
It's fine.

08:34.000 --> 08:35.000
You have to know what you're doing.

08:35.000 --> 08:36.000
You have to do it differently.

08:36.000 --> 08:40.000
But you can know how much space you actually have remaining.

08:41.000 --> 08:46.000
Also, there was no mention of ButterFest at Google since coronavirus on Google.

08:46.000 --> 08:51.000
This is something that needed to be looked at.

08:51.000 --> 08:54.000
It could be that it actually did not work.

08:54.000 --> 08:58.000
Somebody tried and said, no, this is not working on our offering.

08:58.000 --> 09:02.000
Or nobody just implemented, nobody bothered.

09:03.000 --> 09:10.000
Another thing we've been using EXT4 for the decade on this database, we knew how it behaved.

09:10.000 --> 09:12.000
We knew it performed well enough.

09:12.000 --> 09:19.000
Since ButterFest is completely different, we didn't know how it will react on a database workload.

09:19.000 --> 09:23.000
And there were issues we'll get into that.

09:23.000 --> 09:27.000
There were, we knew that there are unknown unknowns.

09:27.000 --> 09:31.000
Something that we didn't know that will not work before we start the project.

09:31.000 --> 09:34.000
That happened to you.

09:34.000 --> 09:37.000
So here, so I told you about the known unknowns.

09:37.000 --> 09:39.000
We figured that we figured them out.

09:39.000 --> 09:44.000
And then I'll give it to you that I think are the most interesting ones.

09:44.000 --> 09:49.000
The title of the one surprise is Reclaim.

09:49.000 --> 09:52.000
The ButterFest term for defragmentation.

09:52.000 --> 09:56.000
It's causing a massive IO spike on large delete.

09:56.000 --> 10:04.000
My colleague Fastest gave permission to share this graph from a production outage a few months ago.

10:04.000 --> 10:07.000
And this is a commit log Q size.

10:07.000 --> 10:09.000
You can see it grew a lot.

10:09.000 --> 10:13.000
And this was a user customer visible outage.

10:13.000 --> 10:17.000
And why did that show that database clusters?

10:17.000 --> 10:22.000
One line is one database server that it was overloaded.

10:22.000 --> 10:25.000
It was not able to flush rights on time.

10:25.000 --> 10:26.000
So it was not accepting.

10:26.000 --> 10:28.000
It was delaying rights.

10:28.000 --> 10:37.000
And the problem here, the root cost for this, is Rida Head.

10:37.000 --> 10:38.000
Right.

10:38.000 --> 10:39.000
Yeah.

10:39.000 --> 10:42.000
No, for the last one that was not Rida Head.

10:42.000 --> 10:50.000
That was balancing a Reclaim, which I talked before, which we have toggles now.

10:50.000 --> 10:53.000
The other one is Rida Head.

10:53.000 --> 11:00.000
So this is a screenshot from another outage, which I don't have chat sessions for, but we have database graphs for.

11:00.000 --> 11:07.000
This graph on the right is the database, the P99 database latency, which is usually a few milliseconds.

11:07.000 --> 11:10.000
And at some point, they spiked to multiple seconds.

11:10.000 --> 11:13.000
So this was user visible query degradation.

11:13.000 --> 11:16.000
They were not able to see some of the queries.

11:16.000 --> 11:22.000
And if we looked at the disk graphs, we could see that the disk throughput,

11:23.000 --> 11:26.000
but was flattened out at the top.

11:26.000 --> 11:30.000
So this was overloaded.

11:30.000 --> 11:32.000
And why is that?

11:32.000 --> 11:39.000
There are two Rida Head settings on Linux, conveniently, named the same way.

11:39.000 --> 11:45.000
Who, anyone here can tell the difference?

11:45.000 --> 11:46.000
No hands?

11:46.000 --> 11:47.000
Oh, they're going hands?

11:47.000 --> 11:50.000
Yeah, shout out, I'll repeat it.

11:50.000 --> 11:54.000
Yeah, they said they're different places.

11:54.000 --> 11:55.000
Yes.

11:55.000 --> 12:00.000
So the difference is 32x.

12:00.000 --> 12:02.000
So the first one is block level.

12:02.000 --> 12:05.000
So if a kernel sees that you're reading a block,

12:05.000 --> 12:09.000
it will pre-fetch 1208 kilobytes,

12:09.000 --> 12:14.000
kibibytes from the same, you know, from the disk.

12:14.000 --> 12:16.000
But butter effect is smarter.

12:16.000 --> 12:19.000
If you're reading a large file, if you start reading the large file,

12:19.000 --> 12:23.000
it will pre-fetch up to four mibibytes, megs,

12:23.000 --> 12:25.000
from the point that you're reading.

12:25.000 --> 12:30.000
So this alone could cause 32x read amplification.

12:30.000 --> 12:33.000
So you're reading, even though you may not be using it,

12:33.000 --> 12:35.000
but it will still fetch it from the disk.

12:35.000 --> 12:37.000
And this was exactly what we saw in the graph.

12:37.000 --> 12:41.000
It was pre-fetching a lot of stuff for the files that it didn't actually use.

12:41.000 --> 12:44.000
So once we, this is a configurable toggle,

12:44.000 --> 12:48.000
we moved the four megs to 1208 and all as well.

12:48.000 --> 12:51.000
I'll skip the timeline.

12:51.000 --> 12:53.000
So we started the project in June,

12:53.000 --> 12:56.000
and we finished it at the end of last year.

12:56.000 --> 12:58.000
And takeaways.

12:58.000 --> 13:01.000
We, our disk savings,

13:01.000 --> 13:04.000
instead of 65% predicted,

13:04.000 --> 13:08.000
where 74% butter effect is now available

13:08.000 --> 13:14.000
in upstream Google Cloud Platform by the infrastructure

13:14.000 --> 13:17.000
that runs Google managed Kubernetes.

13:17.000 --> 13:21.000
We're planning, we as Kronosphere are planning to go full upstream

13:21.000 --> 13:23.000
in the next one or two months,

13:23.000 --> 13:25.000
and you know the sites Facebook,

13:25.000 --> 13:28.000
who are very open about their butter effect use.

13:28.000 --> 13:31.000
Now you have another company to refer to

13:31.000 --> 13:35.000
that are using butter effect in production with quite a lot of data.

13:35.000 --> 13:40.000
Special thanks to Boris Borkoff and Yots of Bicycle

13:40.000 --> 13:42.000
for reclaim, for managing the reclaim,

13:42.000 --> 13:45.000
making our lives easier to manage the cluster,

13:45.000 --> 13:47.000
and for spreading the word.

13:47.000 --> 13:49.000
The talk was a game chain.

13:49.000 --> 13:50.000
The talk that I mentioned before

13:50.000 --> 13:52.000
was a game changer to adopt it.

13:52.000 --> 13:55.000
Thanks to all of butter effect maintainers,

13:55.000 --> 13:57.000
and Google Cloud Platform,

13:57.000 --> 13:59.000
internal champions,

13:59.000 --> 14:03.000
who champion butter effect and made it work on their end.

14:03.000 --> 14:06.000
That's all from me.

14:07.000 --> 14:09.000
Thank you.

14:18.000 --> 14:21.000
Yeah, we still have a few minutes for questions.

14:30.000 --> 14:32.000
Last time I checked butter effect,

14:32.000 --> 14:36.000
as usually got a lot worse performance numbers

14:36.000 --> 14:37.000
in terms of throughput,

14:37.000 --> 14:38.000
compared to X4,

14:38.000 --> 14:40.000
what's your experience with that?

14:40.000 --> 14:41.000
Yes.

14:41.000 --> 14:42.000
In our workload,

14:42.000 --> 14:43.000
that was really favorable,

14:43.000 --> 14:46.000
and there was no visible difference at all,

14:46.000 --> 14:50.000
or latency and throughput.

14:53.000 --> 14:55.000
And what's the size of your Nickstore,

14:55.000 --> 14:57.000
so you ran the command like...

14:57.000 --> 14:59.000
Oh yeah, yes.

15:03.000 --> 15:04.000
60%.

15:04.000 --> 15:06.000
So 40% savings.

15:13.000 --> 15:14.000
Hello.

15:14.000 --> 15:15.000
Yeah.

15:15.000 --> 15:16.000
Thank you for your talk.

15:16.000 --> 15:18.000
So I'm a big ZFS fan,

15:18.000 --> 15:20.000
and there's this thing called

15:20.000 --> 15:22.000
opportunistic compression,

15:22.000 --> 15:26.000
where it's not going to bother compressing the rest,

15:26.000 --> 15:28.000
while writing.

15:28.000 --> 15:30.000
If it doesn't detect any savings,

15:30.000 --> 15:32.000
I don't know if butter does that as well.

15:32.000 --> 15:33.000
Yes.

15:33.000 --> 15:34.000
So it looks...

15:34.000 --> 15:35.000
The question...

15:35.000 --> 15:36.000
Yeah.

15:36.000 --> 15:37.000
It will look at the beginning of the file,

15:37.000 --> 15:38.000
and it will determine

15:38.000 --> 15:40.000
whether it's work compressing.

15:40.000 --> 15:42.000
And it will either do it or not.

15:42.000 --> 15:44.000
There's another one.

15:44.000 --> 15:45.000
Behind you.

15:48.000 --> 15:50.000
Have you ventured into

15:50.000 --> 15:53.000
butter effect rate support with rate 5 and 6?

15:53.000 --> 15:56.000
We did not find a need for this,

15:56.000 --> 15:57.000
so I did not test it.

16:01.000 --> 16:02.000
Hi.

16:02.000 --> 16:05.000
How are you managing FLOCated files

16:05.000 --> 16:07.000
that are uncompressible,

16:07.000 --> 16:09.000
usually created by database,

16:09.000 --> 16:12.000
record, and other?

16:12.000 --> 16:14.000
So you asked about FLOC?

16:14.000 --> 16:15.000
Yeah.

16:15.000 --> 16:16.000
It's okay, no butter effect.

16:16.000 --> 16:17.000
So in our database,

16:17.000 --> 16:19.000
since I'm in the database team,

16:19.000 --> 16:23.000
we control the code that writes to the disk,

16:23.000 --> 16:25.000
so we don't FLOC here.

16:25.000 --> 16:27.000
That's it.

16:28.000 --> 16:30.000
I think you can also turn it off,

16:30.000 --> 16:32.000
but that could be...

16:34.000 --> 16:36.000
Yeah, that's it.

16:36.000 --> 16:39.000
What is main reason why it

16:39.000 --> 16:41.000
chooses butter effects for servers,

16:41.000 --> 16:42.000
for server uses,

16:42.000 --> 16:43.000
instead of Z-Vets,

16:43.000 --> 16:46.000
which is usually more popular in...

16:46.000 --> 16:48.000
It's much easier.

16:48.000 --> 16:49.000
Yes.

16:49.000 --> 16:50.000
The question is why butter effects

16:50.000 --> 16:51.000
not Z-Vets?

16:51.000 --> 16:53.000
Basically upstream support.

16:53.000 --> 16:55.000
You get it in the kernel right there.

16:56.000 --> 16:59.000
There's a lot of tooling already in place for butter effects,

16:59.000 --> 17:01.000
for those user space parts,

17:01.000 --> 17:03.000
like the container storage interface.

17:03.000 --> 17:06.000
It's a much tougher sell for a big,

17:06.000 --> 17:10.000
for a larger company to use that effect

17:10.000 --> 17:13.000
because of the licensing and non upstream status.

17:13.000 --> 17:14.000
With butter effects,

17:14.000 --> 17:15.000
it's pretty straightforward if you can prove

17:15.000 --> 17:17.000
that it will work technically.

17:19.000 --> 17:23.000
Are you using the snapshot features of butter effects at all?

17:24.000 --> 17:26.000
The snapshot and the revert features

17:26.000 --> 17:28.000
are any of the more interesting bits

17:28.000 --> 17:30.000
from a normal system usage of it,

17:30.000 --> 17:32.000
or is it just file system that is?

17:32.000 --> 17:33.000
It's right now.

17:33.000 --> 17:34.000
No, right.

17:34.000 --> 17:35.000
Yes.

17:35.000 --> 17:37.000
We have just finished moving everything.

17:37.000 --> 17:39.000
So we're not using the advanced features.

17:39.000 --> 17:40.000
Just check something.

17:40.000 --> 17:42.000
We have disabled some of the application checksums,

17:42.000 --> 17:44.000
which we knew we could disable,

17:44.000 --> 17:46.000
and everything else is just a dumb database.

17:49.000 --> 17:51.000
So I'm wondering on butter effects,

17:51.000 --> 17:53.000
the copy on the right behavior,

17:53.000 --> 17:57.000
is it has the reputation to not work very well with databases,

17:57.000 --> 18:01.000
and I was wondering what database engine do you use on butter effects?

18:01.000 --> 18:05.000
So we use M3DB, which is,

18:05.000 --> 18:07.000
which was open,

18:07.000 --> 18:11.000
which is our fork of open source M3DB.

18:11.000 --> 18:12.000
So we wrote it,

18:12.000 --> 18:15.000
and then because of the nature of the database,

18:15.000 --> 18:18.000
we don't use a lot of splicing in the middle of the file.

18:18.000 --> 18:21.000
That's why it's a really good tip.

18:21.000 --> 18:25.000
But I acknowledge it doesn't work for every database,

18:25.000 --> 18:27.000
but it just works really well for us.

18:29.000 --> 18:32.000
You said that the checksumming

18:32.000 --> 18:35.000
will simplify your applications.

18:35.000 --> 18:38.000
Yes, so we don't do it.

18:38.000 --> 18:41.000
That's it.

18:41.000 --> 18:45.000
In some cases, we don't have to do the checksumming,

18:46.000 --> 18:48.000
so it saves on the CPU.

18:48.000 --> 18:52.000
Well, the checksum is still happening just in the kernel,

18:52.000 --> 18:54.000
but it's not happening just frequently,

18:54.000 --> 18:57.000
because in our case it's easier to control.

18:57.000 --> 19:01.000
Like, we don't have to do as much checksums as we need it to,

19:01.000 --> 19:04.000
because we know that what we wrote, we get out.

19:04.000 --> 19:07.000
As long as you know before it goes out of the network.

19:08.000 --> 19:11.000
You are planning to do the, the, the, the application.

19:11.000 --> 19:15.000
The application, what, can you elaborate?

19:15.000 --> 19:18.000
You know the feature of the application.

19:18.000 --> 19:21.000
So for the header, for metadata.

19:21.000 --> 19:24.000
Do you see a advantage for,

19:24.000 --> 19:27.000
Yeah, in, in this case for Google's persistent disk,

19:27.000 --> 19:29.000
there is no need to do duplicate anything,

19:29.000 --> 19:31.000
because they're reliable enough.

19:31.000 --> 19:33.000
If we ran off on our disks,

19:33.000 --> 19:36.000
then that would be a different thing.

19:37.000 --> 19:39.000
Hi, thank you for your talk.

19:39.000 --> 19:43.000
Um, do you use the send and receive features of betterFS?

19:43.000 --> 19:44.000
Right.

19:44.000 --> 19:46.000
I can't hear or see here.

19:46.000 --> 19:47.000
Sorry.

19:47.000 --> 19:49.000
That, thanks for the talk.

19:49.000 --> 19:53.000
Do you use the send and receive features of betterFS?

19:53.000 --> 19:55.000
We tried them, but for our case,

19:55.000 --> 19:58.000
the CPU overhead was prohibitively.

19:58.000 --> 20:00.000
It was, it was faster to do something else.

20:00.000 --> 20:01.000
Like, it didn't help.

20:01.000 --> 20:05.000
Through put increased, but it still took a lot of CPU.

20:06.000 --> 20:08.000
I've been prepared from the next speaker.

20:08.000 --> 20:09.000
From the glass.

20:09.000 --> 20:10.000
Thank you very much.

20:10.000 --> 20:12.000
Actually, I'd be.