WEBVTT

00:00.000 --> 00:11.520
Okay, so welcome to my talk, called UltraFast, Lua Jason Parsing.

00:11.520 --> 00:17.240
It's called UltraFast, because at the time when I was submitting the proposal, I still

00:17.240 --> 00:22.400
wasn't sure how fast it's going to be, so we will see.

00:22.400 --> 00:28.480
So yeah, a bit about me, why I'm interested about Lua and Jason.

00:28.560 --> 00:36.320
I'm mass software engineer at BMG. It's a video game, because in later we have a close source

00:36.960 --> 00:44.400
engine, unfortunately, but the Lua code, which is most of the logic, is open source with an open source

00:44.400 --> 00:52.000
license, you have to buy it though. And my focus there is on the Linux port, security, so we are

00:52.240 --> 00:59.600
unboxing some user code and also a bit of like hacking on LuaJet, so if you want to talk about

00:59.600 --> 01:09.200
any of this, I would be happy to. So what is LuaJet if you didn't hear? It's a compiler, just in time

01:09.200 --> 01:16.640
compiler for Lua programming language, and Lua is really commonly used in game development.

01:17.600 --> 01:24.240
LuaJet comes also with a hyper from an interpreter written in C, but it's compatible with

01:24.240 --> 01:33.680
the Lua API, and why do we use it for the game, because these logic virtual machines are

01:33.680 --> 01:40.960
very small, and we run lots of them in threats, so forever vehicle, we run a single logic VM,

01:41.600 --> 01:50.320
which then they communicate with each other, and LuaJet is also fast, we have quite high update

01:50.320 --> 01:59.520
rate, so some systems in the game run up to two kilohertz, and this is the problem.

01:59.840 --> 02:12.080
Okay, this might be bad. Okay, this is the problem. So Jason, you know, it's a notation

02:12.080 --> 02:18.640
how to or a format, how to exchange data, is a strength format, and Lua is a program language,

02:18.640 --> 02:27.040
so this program language has basically only one complex data type, which is both

02:27.280 --> 02:37.760
an associative array, and also sequential array, so it has this like keys and values,

02:39.520 --> 02:47.040
and yeah, this is not, we are not interested on the speed, on the pure speed of the Jason

02:47.040 --> 02:55.200
parsing, but also in the creation of the Lua object, because in the end, both of these parts

02:55.760 --> 03:04.320
need to be fast enough, for the use case, because we need to work with this object, not just

03:04.320 --> 03:14.640
parse the representation into an intermediate structure. So of course, there are so many Jason

03:14.640 --> 03:23.040
libraries available in C++, or even in Lua, there is a lot of libraries available, or you can

03:23.120 --> 03:33.760
combine this, because Lua has a CAPI, so or some people made this publicly available, or you can

03:33.760 --> 03:42.160
also do it on your own, you can take any C++ library, and then use some sort of Lua bindings

03:42.160 --> 03:52.000
with C or C++. They exist, of course, but let's go by one by one. So with this pure Lua libraries,

03:52.640 --> 03:59.120
which are available on the internet, you can integrate it into your cultural easily,

04:00.080 --> 04:09.680
but at least the ones that I tested are just not performant enough, because Lua is still a dynamic

04:09.680 --> 04:16.960
scripting language, or these implementations are not focused frequently on performance.

04:17.280 --> 04:27.520
So let's go for the C++ libraries, and then here I think the state of the art is Cindy Jason,

04:27.520 --> 04:36.800
which is very fast based on their own benchmark. They say they can parse Jason at gigabytes per second.

04:38.080 --> 04:45.840
Of course, for this, you need the extra bindings, which might make the whole thing slower.

04:47.280 --> 04:54.560
And the biggest problem for our use case, of course, you can use Cindy Jason for Jason parsing,

04:54.560 --> 05:02.640
but our Jason format is actually not strict Jason, and we have some extra features in there,

05:03.680 --> 05:13.200
so which I will talk about later. And then you have like the middle ground, these bindings

05:13.280 --> 05:21.280
they suffer from the same problems as the pure C++ libraries. So it means like for example,

05:21.280 --> 05:27.840
if you extend Lua's Cindy Jason, you still have the problem that it's like a strict parser.

05:28.880 --> 05:34.480
And some of them, for example, also the Lua's Cindy Jason, which is public available,

05:34.480 --> 05:40.800
they use the Lua's C API, and this might be a surprise, but using the Lua's C API is

05:41.600 --> 05:46.320
not so perform, and it actually hurts the parsing a lot, as we will see.

05:48.000 --> 05:52.000
So it's time to reinvent the wheel and write a Jason parser again.

05:53.040 --> 06:00.400
And this is our format, it's called Jbim, but in the end it's just Jason with comments,

06:00.400 --> 06:08.480
or as Jason, there are some, you might see this file, it kind of looks weird, we can check all the

06:08.560 --> 06:14.240
differences. So you have comments, which you don't do anything, but you have to parse them,

06:15.440 --> 06:21.280
you can have equal signs instead of columns just for like whatever sake.

06:22.160 --> 06:28.320
Cmas are fully optional, they are treated as white space, and also you don't need to cove your

06:28.400 --> 06:38.720
strengths when they are the keys of the dictionaries. Just basically it. Yes, so you can find

06:38.720 --> 06:45.680
this format, it was used in multiple other kind of source first, but there is not one single

06:45.680 --> 06:54.160
exact specification for this. So we wrote our own parser, and first we wrote it in Lua,

06:54.240 --> 07:05.440
because that is easy. And it's already used in the game for the past 10 years, for example,

07:05.440 --> 07:10.320
so it's well, but it's small, you can write the parser in 300 lines of code.

07:12.160 --> 07:20.000
And it also parsed Jason because as Jason or the Jbim

07:20.160 --> 07:30.720
is only a superset of Jason. But of course, it has some caveats, it is not a validating parser,

07:30.720 --> 07:37.520
because for example, it fully ignores Cmas, so it cannot tell you whether Jason files

07:37.520 --> 07:43.120
a false specification or not, so it also parses some files which are not strict Jason,

07:43.200 --> 07:50.320
and it was written to be Jet Friendly as we used logic. So let's just, yes, or there is going to be

07:50.320 --> 08:02.400
some quite of source code. So let's see the basics. This is the basic function, and

08:04.240 --> 08:12.400
yeah, we start like doing some tricks to be faster. So logic is or Lua is a garbage collected

08:12.480 --> 08:19.600
language, so for the duration of the parsing, we disable garbage collection, because it's

08:19.600 --> 08:28.720
much faster when we are working with lots of new objects, so that is trig number one.

08:30.320 --> 08:39.760
And then skipping white spaces is a big part of the Jason parsing, because there can be quite

08:40.080 --> 08:47.120
a bit of useless white space in Jason, of course, not if your Jason is modified, but in this case,

08:47.120 --> 08:55.120
we have like semantic meaning in all these comments, so we want to keep them. And Jason parsing

08:55.120 --> 09:03.280
is just going through all the characters, basically, and checking whether they are a space or

09:03.520 --> 09:11.120
whether they are a comma. As you can see, we have the ASCII values for this, because Lua

09:11.120 --> 09:18.560
just doesn't have a character data type, so you would have to use a one-character string, which is

09:18.560 --> 09:27.600
much slower, so to go faster use numbers. And the important takeaway, white space skipping can

09:27.680 --> 09:39.520
take a lot of time. We will see later that some optimizations are related to white space skipping,

09:40.480 --> 09:49.440
and the core of this algorithm of the Jason parsing is this peak table, which only loops

09:49.440 --> 09:58.160
at the next character, and basically what you need to parse, if you have written a parser before

10:00.000 --> 10:08.000
Jason is really easy, you don't need to look far ahead, you just get to see the one single

10:08.000 --> 10:17.040
next character, and this is the full full record, like full implementation of the parsing,

10:17.040 --> 10:24.240
like if you see the next character, it can only be null or it's invalid Jason, the same for

10:24.240 --> 10:29.840
some Boolean values, I especially because you can have infinity, which is the number,

10:30.880 --> 10:39.120
and then on these strings and for our use case comments. So our peak table, it just loops again

10:39.120 --> 10:46.560
at the ASCII values, and based on that, for example, if it encounters the value of t, then it just

10:46.640 --> 10:57.520
checks if the next three values are RE, so true, and if not, then it fails. And the only like

10:57.520 --> 11:08.560
more complex functions are the one that parses your objects or arrays. So you see that first

11:08.640 --> 11:16.560
we like allocate a new table, then just read the key, which is a string in Jason,

11:16.560 --> 11:23.600
then skip Y space, read the value, which can be anything, so here is the recursion, and

11:25.200 --> 11:34.400
does it, that's basically it, but again there are some optimizations done. For example, this

11:35.360 --> 11:45.600
table new function, you specify the number of values, then what they should have pre-allocated.

11:45.600 --> 11:52.160
So you can, and logic tables are special in this way, that they have the sequential array part,

11:52.160 --> 11:58.880
and then the hash part, which is the ASCII array. And we specify,

11:58.960 --> 12:09.360
yeah, we specify the three values, because it was a value selected by benchmarking, of course,

12:09.360 --> 12:18.480
but it's important to focus on this if you are optimizing, and another of the tricks

12:19.600 --> 12:25.760
is actually like we have skipped Y space function, I showed it to you, but it's proved to be faster

12:25.760 --> 12:35.920
to copy this, skip Y space implementation into the code, because logic or logic in general

12:35.920 --> 12:46.000
has over it for a function calls. So sorry, so inlining gives you some extra performance boost.

12:47.600 --> 12:54.160
Now let's focus on the benchmarking. I used the latest commit of logic, because it's

12:54.480 --> 12:59.600
continuously developed project, this is my CPU, my system doesn't matter, too much.

13:00.640 --> 13:10.480
I made this benchmarking protocol on the go, yeah, this is my kind of like first time benchmarking

13:10.480 --> 13:16.720
project, so of course like you have probably seen the talk before, how to measure performance

13:16.800 --> 13:24.240
reliably, so followed it. But in the end, I got some of these things right. I run passes

13:24.240 --> 13:32.160
through some adjacent data sets, and I run either, I get some number of these passes, or some

13:32.160 --> 13:43.520
timeout runs out, and I do some things for this to not for the runs not to be dependent to much

13:43.680 --> 13:50.480
on each other, and I collect the garbage before I measure, and I also flush the jit cache,

13:50.480 --> 13:58.400
which means like the jit has a clean state, so in theory, it shouldn't be influenced by the fact

13:58.400 --> 14:05.040
that I already like parsed this JSON file one thousand times before, because otherwise we would

14:05.040 --> 14:11.200
just have ever think in the jit cache. I take the mean time of the pass, and I repeat everything,

14:11.200 --> 14:18.720
so I have like some inter, these like inter runs, I repeat everything five times, take the mean of

14:18.720 --> 14:26.400
the means, and calculate throughput out of it. I also was looking at the minimum and maximum,

14:26.400 --> 14:32.880
but I didn't include it in these numbers, to many numbers, and to further reduce the variance

14:32.880 --> 14:43.920
of on my machine, yes, I set the priority to the maximum, and it also helped a bit to

14:43.920 --> 14:53.760
always force the same CPU core, like the same CPU affinity using tasks. So let's talk a bit about

14:53.760 --> 14:59.920
the data sets I used. The first one is the jit beam data sets, so this is the file that we actually

14:59.920 --> 15:10.640
wanted to parse. These are the S JSON files, which are mostly numbers, some positions, and short strings,

15:11.680 --> 15:23.040
and they represent the structure of the physical structure of these vehicles. So, they tell

15:23.120 --> 15:31.840
which of these nodes, which have IDs are connected to each other, and the strength of these

15:31.840 --> 15:41.200
connections for the purpose of the simulation. And I did the benchmark, and our parser is the fastest,

15:41.200 --> 15:48.480
because the other parsers do not parse as JSON, because there are commands, so this does not tell

15:48.560 --> 15:58.640
as much. Therefore, I made the JSON data set, which contains some testing files, which

15:58.640 --> 16:07.280
simply JSON is all using. So, 20 files with different sizes, different context, different structures,

16:08.800 --> 16:17.280
and I run it again in total 15 megabytes, and our parser is not the fastest anymore.

16:18.000 --> 16:28.400
You see with jit, it still reaches like reasonable performance, but loss in JSON is faster. We don't

16:28.400 --> 16:41.280
get the gigabytes per second. They were advertising, but in the end, it's okay, but we were not happy

16:41.280 --> 16:47.520
with it. Also, when you look at the number with jit off, which is a realistic situation,

16:47.520 --> 16:53.120
there are some platforms where jit is not available, then the performance is really bad.

16:54.720 --> 17:03.600
And yeah, this is just an extra one file data set. I wanted to see the difference between these

17:03.600 --> 17:09.600
all like 20 files and one file that I already like understand what's inside, so these are just like

17:10.240 --> 17:18.240
Twitter statuses, it's really commonly used, and the numbers are quite similar, or at least the

17:20.960 --> 17:31.120
positions of these parsers. So, we have this loop parser, which you have bits of

17:31.120 --> 17:38.000
very pure loop parsers, but the SIMD JSON is very, and also the performance of the loop parser

17:38.000 --> 17:46.800
relies on jit, and this is the point of the talk, if we can go closer to the source code and

17:47.680 --> 17:56.080
get a faster implementation. Just a bit about logic source code, it's not, it's not large, it's

17:56.160 --> 18:06.400
around 70, 70 c files, and some handwriting assembly, but in the end, if we just want to

18:06.400 --> 18:12.960
hack on it a bit, we are interested in two files, and the one which does serialization of some

18:12.960 --> 18:24.560
internal logic format and string buffers. So, string buffers are, let's say, a module in

18:24.560 --> 18:32.960
logic, which allows you to have mutable strings, because normal logic strings are not mutable,

18:32.960 --> 18:43.520
and are also interned, and this is the serialization function that logic already has. So, you

18:43.520 --> 18:51.520
encode a table, you see this is the same as encoding a table, for example, to JSON, but it

18:51.600 --> 18:59.120
comes to a different format, and you can also take this format and decode it simple, but we can

18:59.120 --> 19:07.840
steal the source code of this, and basically implement it like to generate JSON and deco JSON instead

19:07.840 --> 19:14.960
of this. So, it's almost the same, you see the same primitive skipping white space, and again

19:14.960 --> 19:23.440
like peeking on the next value, is the same, here we go, just like character, like character.

19:25.120 --> 19:34.960
So, here are some simple things, I will start skipping a bit, because, okay, and here is really similar

19:34.960 --> 19:43.120
to the implementation. Again, you need to pre-licate some memory, with the type of new function,

19:43.200 --> 19:51.680
it does the same thing, except this number is in the C function is one more, because in the

19:51.680 --> 20:01.120
CAPI index from zero, and in the normal one from one, and then it writes the lower values.

20:01.920 --> 20:08.720
So, we did it like this knife implementation, and it's much better when the JSON is soft,

20:08.720 --> 20:15.920
like, and the performance doesn't matter on the jet, but it's, we didn't get, we didn't get an

20:15.920 --> 20:24.320
speed up, or it's even worse. So, what optimizations can we do? Branch predictor hints are

20:24.320 --> 20:33.040
over a part of, of logic, and we can basically say, like give a hint to the compiler that this

20:34.000 --> 20:42.240
shouldn't happen, so you should always like, yeah, you should expect that the likely part

20:42.240 --> 20:49.200
is happening. So, for example, here, every time we encounter an error flow, like an invalid

20:49.200 --> 20:55.840
invalid JSON, or the end of the file, we don't need to care about it, like, if it's slower fast.

20:55.920 --> 21:05.920
So, it's, it's unlikely to happen for us. Then, the initial number parsing, I, I use the

21:05.920 --> 21:15.440
logic function, which deals with all sorts of formats, but I did some profiling, and it show

21:15.440 --> 21:23.600
that it's not, it's not super-formant. So, I simplified the code, I really wrote like the

21:23.600 --> 21:31.680
knife implementation, where it just like parsed the digits until you encounter the

21:31.680 --> 21:40.800
digit dot, and then you put the fraction value, it was, it was much faster. For some of these

21:40.960 --> 21:48.320
scientific numbers, I still use the original function, because I'm lazy, but these numbers

21:48.320 --> 21:56.480
are not common, in our use case. Then, in lining, we saw it in the Lua API, but here, Lua

21:56.480 --> 22:04.240
Jit already contains the macros, so we can use them, and I want to really like, hard with the

22:04.240 --> 22:12.080
inlining, because I can, because I also wanted to have the functions real module, but in the

22:12.080 --> 22:18.560
end, in the compiled code, I want everything to be in light, so most of the functions in the parsers

22:18.560 --> 22:27.280
are in light. And then, yeah, I don't forget to use a profiler, and I saw a lot of the time

22:27.280 --> 22:35.680
is spent like while resizing the tables, the Lua tables, and this happens before, because when I'm

22:35.680 --> 22:44.240
parsing, I tell the initial size of the table, and then I parse the children, but there can be,

22:44.240 --> 22:51.200
for example, like, 5,000 children, and I only allocate this space for three elements, so this means

22:51.280 --> 22:58.400
I will have to reload from three, and probably double the value a lot of times, and it's

22:58.400 --> 23:12.400
going to copy a lot of values. So, there is this branch buffer, which is just a way, which is just

23:12.400 --> 23:18.320
a dynamic stack for these like temporary values. It's important that yeah, it is there for all

23:18.400 --> 23:27.280
the recursion levels, because we parse up to 100 levels of recursion depth, and you still, of course,

23:27.280 --> 23:34.320
need to copy the values from the buffer, but you don't really look anymore, so it helps.

23:35.360 --> 23:41.920
And then, yeah, this is like the cherry on the top for white-specific other implementations,

23:41.920 --> 23:52.560
such as representation, do some intrinsic, so there are two functions where it's proof to be useful,

23:52.560 --> 24:01.440
might be more, but again, I'm lazy. So, yeah, for white-specific, when you are parsing commands,

24:01.440 --> 24:06.240
you just need to find either the end of the line, or for multi-line commands, you need to find the

24:06.320 --> 24:14.560
next star, asterisk, and for string, and you either need to find the matching quote, or there

24:14.560 --> 24:23.200
can be escape characters, and then you go to the slow path. So, yeah, I just show the SSC

24:23.200 --> 24:31.520
for version, which basically, like you set up the search set of the values I want to find,

24:31.600 --> 24:38.720
and the improvement is like I scan through 16 values at once, so this primitive gives me the

24:38.720 --> 24:44.480
index of the first match, or if I don't find a match, then I go to find the next 16 characters,

24:44.480 --> 24:51.600
but it operates on 16 bytes at once, so it provides some speed-up, the same benchmark,

24:52.560 --> 25:03.600
and our parser is faster. Now, we got like some, yeah, 33% improvement, but like,

25:03.600 --> 25:10.400
which one of these optimizations was helpful, you might ask, or not, so I did some

25:10.400 --> 25:18.240
ablations study, I disabled, yeah, the final version is the one that is published, I disabled

25:18.640 --> 25:26.320
these optimizations, but one by one, and basically, you can see that probably all of them are in

25:26.320 --> 25:35.280
some way helpful, and if you go to the SSC for inference, you even might get some small boost,

25:35.280 --> 25:41.520
but the default version uses SSC too, because they are available on every x64 CPU.

25:41.520 --> 25:52.800
And on the JSON parsing data set, this is important, we beat loss in the JSON, so out of the

25:52.800 --> 25:59.360
known parser, that I knew at the time of making this, and still this should be the fastest

25:59.440 --> 26:14.800
widget parser, and on 15 JSON is the same. So, we gain some speed-up, 50% speed-up, one jit is on,

26:14.800 --> 26:21.440
but like one jit is off, which might be a real important for some platforms, we get more than 10 times

26:21.520 --> 26:30.400
speed-up, so for these files, like parsing them, can take, yeah, instead of three seconds,

26:31.120 --> 26:38.560
it can take 0.3 seconds, and on the JSON data set, we are faster, and much faster than the

26:38.560 --> 26:45.760
other implementations. So, some tips at the end, try to combine these tricks, but you have to

26:45.840 --> 26:54.320
measure them, use the profiler, minimize coping, and yeah, like focus on your files, if you are

26:54.320 --> 27:01.680
writing a parser yourself, if your files are not using scientific numbers or escape characters,

27:01.680 --> 27:09.520
then you can make the implementation slow, and you will save some time. So, there are also limitations,

27:09.520 --> 27:19.120
yeah, like the only thing I like which we care about is the JSON parsing of the file for the

27:19.120 --> 27:26.240
first time, but like it's very hard to measure the cold parsing time. So, in the end, we just take

27:26.240 --> 27:33.920
this like proxy, a measurement, of course, it's a big limitation, and yeah, I had problems, like with some

27:33.920 --> 27:38.640
unstable terraformant numbers, in the end, like loads of these tricks helped, but there are still

27:38.720 --> 27:44.320
problems. If you run the benchmark yourself, you will probably get a bit different values,

27:45.520 --> 27:54.400
even if you had the same CPU and system as me. So, I think that is, are we finished? This

27:54.400 --> 28:00.240
slides would say no, but I say yes, because we still have JSON encoding, but it's really simple.

28:01.040 --> 28:08.080
Our encoder happens to also be fast, but I never put any thoughts into the optimization. So,

28:08.160 --> 28:14.640
thank you for bearing with me. Yeah, and if you have questions, then.

28:18.960 --> 28:21.200
Now it's time for a question, if you have one.

28:28.320 --> 28:35.600
Just a small question. So, for the assembly part, did you rely on compiler intrinsic, or did you use

28:35.600 --> 28:44.080
like lupant authorization stuff? Oh, yes, I relied entirely on intrinsic, yeah, for example,

28:44.080 --> 28:49.760
the SSE4 version I'm using the intrinsic here, no else like compiler options, or anything.

28:53.120 --> 28:59.280
I'm not sure if it's relevant, but did you consider, did you had cases where you had

28:59.360 --> 29:06.080
a chunk parsing that you had to parse it part, and then maybe wait on some data, and then

29:06.080 --> 29:15.440
had to parse again. So, has three processing? Yeah, okay. In this case, in this case, no,

29:15.440 --> 29:21.280
like you already need to have all the contents in the buffer, it would be an interesting optimization

29:21.360 --> 29:29.280
in the cases where you are, yeah, it is something to look into, but no, this works only when everything

29:29.280 --> 29:30.560
is already in memory.

29:43.520 --> 29:50.640
Have you considered totally teaching the JSON part, and just writing your data in lupant file,

29:50.960 --> 29:58.720
the RIT? Of course, there might be optimal format, more optimal format.

29:58.720 --> 30:04.240
The problem is, like, this JSON files are there since the beginning, the code basis,

30:04.240 --> 30:09.200
like, more than 10 years old. So, in the end, like, the format is not going away.

30:09.200 --> 30:14.400
We might introduce something like jbm version 2, which might be more performant data format,

30:15.360 --> 30:22.400
but, like, from, like, the relive, like, as if our respective, like, it would be hard to transition

30:22.400 --> 30:25.680
the format. So, we're going to break the JSON.

30:31.520 --> 30:40.080
Are there questions? So, it's look like we are done. Thank you again. Okay, thank you.

