WEBVTT

00:00.000 --> 00:11.760
Okay, cool. So, hi, everyone. I'm Tibu. It's an honor to be here for the third year in

00:11.760 --> 00:17.920
Phas Dam. And this is going to be a technical talk on sharing interest in tips on codec

00:17.920 --> 00:25.360
development. So, the basis of this work is my work that has been going on for 12 years already

00:25.440 --> 00:30.800
on Edge 264, which I will call Edge here. It's an AVC video decoder with a permissive license.

00:32.160 --> 00:38.400
And to me, it has been research on improving performance and size of the decoder by trying

00:38.400 --> 00:44.880
unusual tricks. And thanks to that, it has been come about 15% faster and 4 times lighter than the

00:44.880 --> 00:50.640
state of the art, which is a good achievement, I think. During the past year, I worked towards

00:50.640 --> 00:57.600
release with new things in bold. And the major thing is Netflix joined to support this work. So,

00:57.600 --> 01:03.360
huge thanks to them. It has been a huge boost in motivation. And now I'm steadily steadily

01:03.360 --> 01:11.040
progressing towards release. Without further introduction, let's go over a quick overview of AVC.

01:11.040 --> 01:22.720
So, AVC is the 20 years old codec. And the basic structure is 6 code modules and your data comes

01:22.720 --> 01:31.040
as an coded bit stream goes out as frames. And we're going to focus on the the two first blocks

01:31.040 --> 01:37.600
that is headers and slices parsing the parsing side of the codec. And I'm going to present two

01:37.600 --> 01:43.840
main things, the use of YAML as a format for logging output and the optimization of the

01:43.840 --> 01:52.480
Quebec arithmetic and coder. This is going to be technical. Okay, first YAML, YAML, come on please.

01:52.480 --> 02:00.720
YAML is a data serialization format that is a standard way to convert a language object

02:00.720 --> 02:06.960
in your program in your program into text and back. And in retrospect, that has been one of my best

02:06.960 --> 02:12.880
design choices in the past few years. And I will describe here the choice and what it enabled

02:12.880 --> 02:25.600
in my code base. Okay, now to the meet. Edge in edge I log everything I parse. So, everything

02:25.600 --> 02:33.120
every symbol that I get from the bit stream I log it. And since AVC or H364 is a sequence of units,

02:34.080 --> 02:39.920
my log is going to be an array of objects of YAML objects. And here is an example of the log

02:39.920 --> 02:48.560
output of a sequence parameter set. With YAML, I get these, I get readability, thanks to

02:50.400 --> 02:56.240
having a slim syntax, thanks to syntax coloring in the editor, and thanks to the availability of

02:56.240 --> 03:02.400
comments, which for example Jason doesn't have. So, I can add extra information like the fact

03:02.400 --> 03:08.640
that seven corresponds to a sequence parameter set or that a line was actually inferred, not present

03:08.640 --> 03:17.200
in the bit stream, but inferred by other sequences. I get also compactness by the use of a recursive

03:17.200 --> 03:23.360
format. If you compare that to CSV, that would make many more lines. So, here I can actually

03:23.920 --> 03:30.560
put more symbols into a single line and thanks to the availability of inline structures that YAML

03:30.560 --> 03:37.520
allows. And I also get scriptability by design that is the ability to load your YAML directly

03:37.520 --> 03:46.800
into a programming language without having to parse it before. Okay, with that. So, once you have

03:47.120 --> 03:53.040
the ability to easily feed the log to a program, the next step is re-encoding.

03:54.000 --> 04:01.680
That is converting our YAML back to the original video. This is a good way to check that our

04:01.680 --> 04:08.400
logging is complete. That is that the log represents the whole video and that you're not missing any output.

04:10.400 --> 04:15.840
And I'm going to share three key tips that I think will help you if you want to go the same route.

04:16.960 --> 04:23.360
The first is to use Python or a high level English and I'm serious with that. Python is

04:23.360 --> 04:29.280
excellent for prototyping algorithm and I have been genuinely surprised that it hasn't been used

04:29.280 --> 04:36.320
more to design encoders, especially in initial encoders when you care more about compression than

04:36.400 --> 04:48.640
performance. The second tip is to output context three units. That is, units usually depend on

04:48.640 --> 04:53.280
the previous symbols that you get in other units. Here, you include everything into a single unit

04:53.280 --> 05:01.360
so that the re-encoding becomes easier. For example, here, instead of opening just a frame number,

05:01.440 --> 05:05.600
I output the frame number and the number of bits that were used to encode the data,

05:05.600 --> 05:11.280
or to decode the data so that when re-encoding, I just need to output for bits and not look at

05:11.280 --> 05:18.320
previous units to know how many bits I used to encode this. At this point, if I compare

05:18.960 --> 05:27.440
this script gen ABC with a library that does about the same, I converted from 9,000 lines of code

05:27.440 --> 05:35.920
in C to 500 lines of code in Python. That's almost a 20 times reduction. So pretty good and especially

05:35.920 --> 05:41.120
it's better for maintenance because this script is just a side thing. I won't go back to that.

05:43.280 --> 05:50.480
And the last key tip is that Python has actually built in support for bit buffers.

05:50.480 --> 05:56.080
Integer in Python are infinite so basically you can use them as bit buffers by using a single

05:56.800 --> 06:03.040
set bit and pushing the set bit when you insert new values into the buffer.

06:03.920 --> 06:08.640
I'm putting these operations, these five key operations as reference if you need to use them in

06:08.640 --> 06:12.800
your projects and you may come back to the PDF later of the presentation if you want.

06:16.160 --> 06:22.240
Okay, next step. Once we have the ability to encode, to re-encode a YAML back to a video,

06:22.320 --> 06:29.600
we can create custom test bit strings to stress the decoder. My goal here is to catch up after

06:29.600 --> 06:37.040
FFMPEG and OpenH364 over decades of field testing and field availability. Edge is quite recent

06:37.040 --> 06:43.760
and I don't have the same feedback from a field. And my key tip here has been to search for

06:43.760 --> 06:50.080
shell clauses in the spec that like a thousand or so. For example here, a sequence parameter

06:50.160 --> 06:55.280
set with that particular value of sequence parameter set ID should be shall be available to the

06:55.280 --> 07:00.480
decoding process prior to its activation. That means that whenever you get a sequence,

07:01.360 --> 07:06.480
whenever you reference a sequence parameter set, it has to exist before you reference it.

07:07.520 --> 07:13.520
So for testing, it means that I should create a bit string that contains a unit referencing a unit

07:13.520 --> 07:20.080
that does not exist. Okay, that's a tricky thing. But basically I read the spec, I read the

07:20.080 --> 07:25.360
shell clauses and I create, I imagine bit strings that should fail. And then I convert that into a

07:25.360 --> 07:32.000
YAML and into a test bit string. The good thing with these custom tests is that they can be

07:32.000 --> 07:37.200
included into public repositories that my own repository since I created them. So I own them.

07:37.920 --> 07:41.760
And also since these bit strings are much smaller than conformance bit strings,

07:41.840 --> 07:46.320
they are much faster to run as tests. So the automated tests are much faster and I don't need

07:46.320 --> 07:49.360
to include the conformance bit strings into the public repository.

07:53.680 --> 08:00.800
The last key point is data analysis. Thanks to YAML, I can write scripts in Python again

08:00.800 --> 08:07.600
to generate visualizations. And this is an ongoing work to design multi-threaded scheduler.

08:08.320 --> 08:12.960
The code samples are available in the repository to reference receipt. I'm not going to go with them.

08:14.240 --> 08:22.160
Okay, midway. Now Quebec. First question, who among you is familiar or has heard about Quebec?

08:23.520 --> 08:34.000
I guess not being able. Okay. Okay. Quebec is a layer of compression on H264 ADC that takes the input

08:34.000 --> 08:41.440
bit stream and a context and yields bits one by one. You ask it. I want one bit of a motion vector

08:41.440 --> 08:47.120
or I want one bit of an IDCT coefficient and it gives you this bit. It's a very low-level

08:47.120 --> 08:53.840
compression thing. This is a screenshot of edge showing how bad it is on performance. It's about

08:53.840 --> 08:59.120
half of the performance goes just to Quebec. It's almost impossible to vectorize the serial process

08:59.760 --> 09:06.640
love it. And it has been reused in HVC and VVC. So everything that I'm presenting here will

09:06.640 --> 09:13.520
matter to these phones too. Okay, let's first describe a theoretical arithmetic decoder.

09:19.760 --> 09:26.320
A theoretical arithmetic decoder has one state that is a floating point, the offset. It's a

09:26.320 --> 09:31.760
floating point with infinite precision that is loaded from the bit stream. And for each request

09:31.760 --> 09:37.920
of one bit, you would fetch a probability threshold that is given by the symbol you want to decode.

09:39.760 --> 09:47.760
You will compare the offset with this threshold and basically rescale the slice that was selected

09:47.760 --> 09:54.560
into 0 and so basically scale the offset with the range so that it becomes 0 and again.

09:56.560 --> 10:01.840
So here we're in 0 than 1 than 1. So the bit style returned only going to be 0, 1 and 1.

10:02.480 --> 10:08.400
This is a really quick crash course on arithmetic decoding but I encourage you to look at Wikipedia

10:08.400 --> 10:16.080
if you want to learn more. Quebec does the same but with integers, 9 bit integers. The offset

10:16.080 --> 10:24.960
and the range are both on 9 bits. And the rescale happens only when the range has less than 9 bits.

10:24.960 --> 10:32.480
So the Quebec state is on two integers, the offset and the range. The working is the same

10:32.480 --> 10:37.920
when you get the 0 value, you get down but you don't scale up. You only scale up when you need

10:37.920 --> 10:45.360
to have enough precision in the range. Here at the last stage, one range becomes lower than

10:46.640 --> 10:53.840
lower than 9 bits so lower than 256. You scale it up so you multiply range and offset by 2

10:53.920 --> 11:00.800
and take a bit from the bit stream to the offset. That's how we actually reload into the offset.

11:02.160 --> 11:07.680
This is the original algorithm that is implemented in OpenH261 and that explains quite

11:07.680 --> 11:10.400
the lack of performance when you compare it with FFMB.

11:12.720 --> 11:20.080
Now the optimization. If instead you want it to load 10 bits from both offset and range.

11:20.160 --> 11:25.600
So you load 10 bits from offset and the range you shift it up by 1 bit. All the results

11:26.560 --> 11:31.040
would be the same. It would function the same but in the end at the third stage

11:31.920 --> 11:37.520
you would have enough precision in the range so you would not need to rescale. You would not need

11:37.520 --> 11:44.880
to rescale. So that space one renormalization trip to memory and the key tip is here. To do the same

11:44.880 --> 11:52.320
on size T integers or basically machine wide integers and taking here 64 bit machines you would

11:52.320 --> 12:00.640
initially load 64 bits into offset from memory from the bit stream and shift the range by 55 bits.

12:00.640 --> 12:09.440
The extra bits, the extra bits. Then you do your operations and at each renormalization when

12:10.320 --> 12:17.840
range and offset become lower than 9 bits you would refill both with 56 bits or 7 bytes.

12:20.720 --> 12:27.920
This has a advantage that it trades less frequent renormalizations with a new counting of extra bits

12:29.120 --> 12:35.920
and it goes well with extra loads wide loads that I presented last year. Now the wide loads go

12:35.920 --> 12:42.080
well with unifying undefined escaping that I presented last year. For those who followed. The second

12:42.080 --> 12:47.280
trick here that ffmpec doesn't have and I'm really proud of that is to mention is to observe

12:47.280 --> 12:55.920
that range belongs to 256, 511. That means that range is 9th bit is set. So you can use this bit

12:56.480 --> 13:03.120
to count the extra bits and with this it spares the use of an extra variable or an extra

13:03.120 --> 13:13.920
set bit. Second one, second one is awesome but even harder to understand. Okay bypass. In

13:13.920 --> 13:22.080
Quebec bypass is the mode for symbols that Quebec cannot compress basically for which 0 and 1 are

13:22.080 --> 13:28.960
equi probable. In ABC there are always used in loops of many bypass bits because they are used

13:29.040 --> 13:34.480
exclusively for idc t coefficients for long idc t coefficients and for long motion vectors. So

13:34.480 --> 13:39.840
there are both long and very rare. Here we are going to assume that the offset and the range

13:39.840 --> 13:47.840
have enough bits so that we normally to renormalize in between. What we would do is we compare

13:47.840 --> 13:54.400
offset with half the range then output a value. Then we compare offset with one fourth the range

13:54.400 --> 14:00.000
then we are output a value and then we compare offset with one eighth the other range and the

14:00.000 --> 14:08.960
bit sequence that you get is 001. If just to try out we map out all possible values of the offset

14:10.240 --> 14:15.360
the offset that the offset could take. We notice that all bypass results would follow

14:15.360 --> 14:23.200
progression from 0 to 7 left to right. That means to get the same results as many

14:23.200 --> 14:29.520
calls of bypass you could actually just divide the offset by one eighth of the range and doing

14:29.520 --> 14:36.800
this division would yield all three bypass bits instead of having to go through three calls of

14:36.800 --> 14:46.160
get bypass. And the trick is here. In Quebec to get end bypass bits you first ensure that the

14:46.160 --> 14:52.240
offset and the range have both at a more than n extra bits. Then you would shift down the range

14:52.240 --> 14:58.320
by the number of bits you want. You divide the offset by the range. You get n bypass bits and the

14:58.320 --> 15:04.880
remainder of the division is going to be the new offset. Then you feel to return n consumed bits.

15:04.880 --> 15:10.080
For example if the bypass bits were used in a VLC code you would multiply back the end

15:10.080 --> 15:14.320
consumed bit by range at then to the offset and then shift range up by m bits.

15:14.400 --> 15:25.360
OK this trick has no impact in edge. It's just really cool actually because it's really

15:25.360 --> 15:30.640
used in a Vc. The performance is going to be a bit worse for tiny batches because Dave has a

15:30.640 --> 15:37.920
high cost but this technique has a constant cost so it's really good for very complex streams.

15:38.560 --> 15:45.440
However this stream should be way more useful for HVC and VVC that use bypass bits a way more.

15:46.240 --> 15:50.960
I'm on time. Thank you for attention. Now I'm going to be a bit faster working on this

15:50.960 --> 15:55.760
thanks to Netflix and my ETA for releasing about two years I think. Thank you for attention.

