WEBVTT

00:00.000 --> 00:13.840
So the next talk, my question is about the next talk by illadruse us about making the

00:13.840 --> 00:16.360
C-O-Mex con and so on.

00:16.360 --> 00:18.360
So, thanks a lot.

00:18.360 --> 00:20.360
Hello.

00:20.360 --> 00:22.760
Thank you for the introduction.

00:22.760 --> 00:33.320
Today I will be talking about what happens if you decide that working with your data structures

00:33.320 --> 00:40.440
in a convenient way is simply to boring.

00:40.440 --> 00:48.480
The agenda of this talk is, first one, I will talk about what was actually the trigger

00:48.480 --> 00:50.920
to study all these.

00:50.920 --> 00:59.600
And then we will briefly talk about the land of the lock free because all of this is

00:59.600 --> 01:03.280
actually inspired in the land of the lock free.

01:03.280 --> 01:05.360
And then I will outline the solution.

01:05.360 --> 01:14.040
If the time permits, I will also out mention news in the C-O-Mex C-O-Mex kernel since

01:14.040 --> 01:16.120
the last was them.

01:16.120 --> 01:17.120
Okay.

01:17.120 --> 01:18.920
So let's start.

01:18.920 --> 01:23.880
What was the problem which made me to think about this?

01:23.880 --> 01:32.120
The C-O-Mex kernel does its best to make life inside the interrupt service handler as hard

01:32.120 --> 01:33.120
as possible.

01:33.120 --> 01:39.480
The reason for this is that we run on very limited hardware where there is a rule that interrupt

01:39.480 --> 01:44.440
service handler has to run in kernel mode or in supervisor mode.

01:44.440 --> 01:50.520
So it essentially has kernel privileges and it can do a lot of bad stuff.

01:50.520 --> 01:56.840
So we only allow two calls to kernel so you are very motivated to defer your work to

01:56.840 --> 02:07.720
into user space, but this by this, you still can buy these two calls modify pretty much

02:07.720 --> 02:11.240
every data structure that is there.

02:11.240 --> 02:21.320
And while it is done from interrupt context, you have to somehow guard how to eliminate

02:21.320 --> 02:29.000
concurrent modification so still things remain consistent.

02:29.000 --> 02:33.920
You can't use mutex's because if you log mutex and you get interrupt, it's incentive that

02:33.920 --> 02:34.920
lock.

02:34.920 --> 02:38.480
What kernels do is they disable interrupts?

02:39.440 --> 02:47.360
As I said, it's safe, it's good, but it's boring and it introduces a lot of latency and

02:47.360 --> 02:56.040
also jitter if the thing you do with interrupt disabled varies in time.

02:56.040 --> 03:05.160
So I've been thinking how to avoid this and I did some investigation.

03:05.400 --> 03:14.680
I investigated how other kernels do it and I realized that many RTOs simply lock disabled

03:14.680 --> 03:17.480
interrupts and call it a day.

03:17.480 --> 03:24.920
Then I have a couple of friends and we have board game evenings with them and at one of

03:24.920 --> 03:33.160
these board game evenings, I mentioned this problem, the constraints and what I want to achieve

03:33.160 --> 03:40.680
and one of them told me, you know, this sounds a lot like you need a lock free approach.

03:40.680 --> 03:46.440
Okay, so I started studying lock free algorithms.

03:46.440 --> 03:53.000
I already knew some very basics of lock free and I guess you also know, but for example,

03:53.000 --> 04:02.840
the case of single producer, single consumer queue, that thing can run without mutex's.

04:04.040 --> 04:10.120
And it's inherently third safe because it's a lock free data structure in this one specific case.

04:12.200 --> 04:13.880
Unfortunately, it cannot be used here.

04:14.840 --> 04:21.960
So I've been searching for some other solution and while it was right around the time when LLMs

04:21.960 --> 04:31.000
became very popular, I asked GPP. Okay, show me some experience which satisfies these conditions.

04:31.160 --> 04:34.920
And it came bit roughly the similar, very similar code to these.

04:36.520 --> 04:44.840
And what this code essentially does is that it creates a local copy of shared data under mutex.

04:45.560 --> 04:52.920
But this is a one section, basically. Then doesn't modification on the local copy and then

04:52.920 --> 05:01.240
commits it into the shared data, which satisfies all my criteria, except with one.

05:01.240 --> 05:09.560
It has a lot of overhead. And it also looks quite a lot like database transaction.

05:11.000 --> 05:21.720
Because you have read, modify, and commit stage. Therefore, I decided I will use this concept

05:21.720 --> 05:31.240
general, but I will make it lightweight. Now, a bit about the lock free programming, what it is actually.

05:32.280 --> 05:36.440
A lot of people think that if you have lock free programming, it means that you are completely

05:37.400 --> 05:44.120
free of any locking. And nothing in the system ever gets blocked. Well, there's no true.

05:45.080 --> 05:52.760
The definition of lock free is that at least one execution context does not block.

05:53.480 --> 05:59.160
And here in the specific case, when we are talking kernel versus interrupt,

06:00.200 --> 06:05.000
this context which you don't want to interrupt SD interrupt service handler.

06:06.040 --> 06:11.480
So what we want to achieve here is essentially a fast line for interrupts,

06:12.440 --> 06:22.200
which kind of allows us to cheat. If you find this guarantee that you only have one context,

06:22.200 --> 06:28.840
which has fast line, to boring, then for you, there's a way to free programming. We actually

06:28.840 --> 06:37.080
does the part that nobody uses locks, and nobody also gets blocked. But that's a lot more complicated

06:37.720 --> 06:44.360
and you get a lot more overhead. Which is actually the reason why not everything is lock free,

06:44.360 --> 06:52.120
because lock free comes with a lot of baseline overhead, which kind of gets paid for

06:52.120 --> 06:58.120
if you do massive concurrency, where the overhead of locking is much higher than baseline overhead

06:58.120 --> 07:05.080
of lock free algorithms. Now, when we know roughly what the lock free is,

07:06.040 --> 07:13.560
and we know that the baseline idea was something like transaction, let's look to

07:15.400 --> 07:22.840
how the transacted part of kernel looks like compared to standard critical section.

07:22.840 --> 07:30.200
The thing on the left is a critical section, modeled by mutexes. So at the beginning of

07:30.920 --> 07:37.400
the access of shared data, you log a mutex, or internal, it could be

07:38.840 --> 07:46.520
you disable interrupts. Then you find what you want to modify, and then you modify it.

07:46.520 --> 07:54.040
And once you are done, you reenable interrupts. What transaction this does is that at the

07:54.120 --> 08:01.720
beginning, you open the transaction. This is a read on the action. Then you find the entry

08:01.720 --> 08:11.160
which you want to modify. This is not an exclusive action. You can be raised by someone else,

08:11.160 --> 08:16.200
both opening transaction and accessing the data. Then you try to commit this transaction,

08:16.760 --> 08:22.680
and you specify that you are modifying the data, that you are going to modify the data.

08:23.320 --> 08:31.240
And this transaction may either pass or fail. If it passes, then you get exclusive access

08:31.960 --> 08:40.120
to this data, and you can modify it. If it fails, you shall not modify it because you will

08:40.120 --> 08:46.120
break the contract. And then once you are done modifying it, you call the x and then

08:46.440 --> 08:53.240
you signalize, you are done modifying the data. So the difference is that the critical

08:53.240 --> 09:01.560
section is runs executably the whole time. Transaction only runs exclusively for the time you

09:01.560 --> 09:09.720
are modifying the entry. It gets even more interesting if we consider the read only case,

09:09.720 --> 09:23.080
because this is a read wide case. So, if you read some non atomic data from shared data

09:23.080 --> 09:30.520
storage in concurrent system, you also use critical section to get the data in consistent state.

09:30.520 --> 09:42.920
With transactions, you perform the wall data access in a shared state, so you can be raised.

09:43.880 --> 09:48.680
And you can be raised even by someone who will modify the data. And the question is, okay,

09:50.440 --> 09:56.600
why would you do that if you have synchronization primitive, which doesn't actually guarantee

09:56.920 --> 10:05.480
consistency of the data? Well, it doesn't guarantee you that nobody will modify the data

10:05.480 --> 10:14.920
structure, but at least if you survive it, it will tell you that it was modified. And the

10:14.920 --> 10:22.360
beginning I mentioned that transaction start is a read only operation. And this read only

10:22.360 --> 10:28.920
transaction has a nice feature that as a wall, it is a read only operation, including the

10:28.920 --> 10:38.600
overhead of transactions up system. And that means that readers don't block each other, readers don't

10:38.600 --> 10:46.200
block writers. And if you write something which someone else is trying to read in between,

10:46.280 --> 10:57.240
readers will know that the data has been modified. And also there's if, which means that you can choose

10:57.240 --> 11:13.400
how you react to the event that you've been raised by someone else. Okay, to detail a bit about

11:13.480 --> 11:21.160
the transactions. What do you usually know as a transaction? Is an acid compliance transaction

11:21.160 --> 11:31.000
from the database systems? These transactions provide a serialization isolation level.

11:32.120 --> 11:40.440
So you can essentially take a list of transactions, replay them in the same order, and you get the same

11:40.440 --> 11:48.200
state each time. Which is actually used by the database systems to do Keter Schroek Recovery.

11:49.400 --> 11:59.800
And this level of isolation is nice for a database, but a bit of an overkill here,

11:59.800 --> 12:07.960
because we would have to do the coppy from shared to local modify and then commit. And it will

12:08.040 --> 12:14.600
also have to provide some temporary storage. And for microcorner running on micro control,

12:14.600 --> 12:20.360
that's not really viable. And I believe it's not really viable for any microcorner anywhere,

12:20.360 --> 12:27.080
because that's really high level of overhead. But there are other over separation levels.

12:29.800 --> 12:35.560
And one of them is so called read committed. In this separation level,

12:36.440 --> 12:42.200
all readers only see the data, which was committed by someone else.

12:44.040 --> 12:52.840
And it turns out that this is just enough separation level, you need to live in kernel

12:52.840 --> 12:58.440
in this weird partially synchronized or partially consistent state.

12:59.160 --> 13:07.880
And therefore, if you run unicore, this framework offers this separation level,

13:08.520 --> 13:18.440
which means that the data structures, as a wall, if you take some hash table or let's say

13:18.840 --> 13:31.160
scheduling table, they may develop some unusual properties, like you have two threats running

13:31.160 --> 13:38.680
on the same core at the same time. But the entry for individual threat is consistent.

13:38.680 --> 13:50.200
Then, as I mentioned, the fact that read-only transactions are purely read-only actions as a

13:50.200 --> 13:59.560
wall, they don't block each other. And you only pay for the overhead of transactions once you get

13:59.560 --> 14:11.000
content shown. And for the just for sake of terminology, if a transaction

14:13.400 --> 14:19.560
if you commit a read-bright transaction, it will invalidate any other transaction which is

14:19.560 --> 14:27.720
implied. So it was opened, but it wasn't committed yet. And when transaction is invalidated,

14:27.800 --> 14:38.120
it will fail to commit. Read-only transactions don't invalidate anything, which is the point of

14:38.120 --> 14:47.640
non-booking each other. So, now, it isn't all roses, of course.

14:47.640 --> 15:03.480
While this is all based on walk-free, we are essentially trading walking for iteration. So when

15:03.480 --> 15:13.080
you get content, the thing you probably want to do is retry. But there are cases, as I mentioned,

15:13.160 --> 15:20.040
you are free to decide what to do. And there is one case. It is actually the first case where

15:20.040 --> 15:24.360
we implemented this, and it's scheduler. Because it may happen that you have normal

15:26.520 --> 15:32.200
scheduler run. So we are deciding the next threat. And then interrupts comes, we change this

15:34.760 --> 15:42.200
state of threat. And it fires another scheduler run, and they run concurrent, one on top of

15:42.200 --> 15:50.360
that another. And in this case, what happens is that the scheduler from interrupt changes state of

15:50.360 --> 15:58.920
threat table, decides new threat, set seat, and returns. Then kernel finishes, and his transaction

15:58.920 --> 16:09.240
for scheduler fails. So what does it do? It gives up. Because someone else already did scheduling for it.

16:12.680 --> 16:21.880
That's a nice feature. For the interesting properties, it means that your algorithms

16:21.880 --> 16:34.760
have to be resilient. You cannot assume that there will be only one max value in something which

16:34.760 --> 16:42.920
has unique values. Which is still good if you are running a unit processor, because there's

16:42.920 --> 16:50.200
this read committed separation. So like internals of the data entry itself will be consistent.

16:50.200 --> 16:57.160
Because whatever race to you did the change automatically because interrupts were disabled.

16:57.880 --> 17:07.320
But it gets very funny when you switch to a multi core because there this assumption doesn't

17:07.320 --> 17:14.680
hold any more. While interrupts are disabled on one core, someone else may read on another core.

17:15.880 --> 17:22.360
But you also need to another problem. It's actually next slide.

17:22.360 --> 17:35.800
That, as it is implemented right now, it only works in Unicorn setups. Because currently,

17:37.240 --> 17:41.560
we commit the transaction at the beginning when you call an transaction commit.

17:43.000 --> 17:51.320
And if you were running multi core, there would be an ambiguity that if anyone opens

17:51.320 --> 17:55.640
transaction while you are inside commit, which is technically possible.

17:58.040 --> 18:06.440
You couldn't say if you have to abort this transaction or not, as an effect of the transaction

18:06.440 --> 18:10.920
which is currently being committed. And there are two possible solutions to this.

18:11.480 --> 18:15.240
Synchronize opening transactions, starting transactions with commit.

18:15.880 --> 18:23.160
Well, it will remove the feature that your read only transactions are read only operations.

18:25.320 --> 18:29.480
But it's the safer way because you get this synchronization and it's clear.

18:30.120 --> 18:40.440
And another option is that we will mark the transaction as committed or the internals synchronization

18:40.440 --> 18:46.120
point will be marked at the end of commit, at the point when you call transaction done,

18:46.920 --> 18:53.320
which also provides the synchronization, but you don't need to use mute access.

18:54.360 --> 18:56.600
So this is the next update which will be done.

19:02.920 --> 19:08.520
Maybe why do you consider transactions? Maybe you think that this must be like

19:08.600 --> 19:13.960
crazy complicated framework. Actually, it's like 95 lines including commands.

19:15.640 --> 19:24.120
And the memory overhead is like you need one field for to remember which was the last committed

19:24.120 --> 19:30.680
transaction ID. It's really trivial. And as long as your data structures and your algorithms

19:30.680 --> 19:39.400
can survive this violent, inconsistent, internalized, internal state, you may benefit from it.

19:42.600 --> 19:47.640
I don't know what will be the general performance of this for two reasons.

19:48.200 --> 19:57.000
Some external case is kind of specific. Our kernel is designed that many actions which we

19:57.080 --> 20:05.080
transacted have quite a long phase where they are finding the entry to modify. Like we are

20:05.080 --> 20:13.320
searching, okay, you want to change the thread state. So search for a thread, take some time.

20:13.320 --> 20:19.160
It's not a one operation and then you change one flag which is one operation. It's a one.

20:19.880 --> 20:31.240
Your mileage may vary if you use different code organization. And then, which we don't know yet,

20:31.240 --> 20:40.040
also, is how funny it will become when we go multi-core. Because there's a lot of unknowns,

20:42.200 --> 20:47.720
which I cannot predict what will happen, what synchronization problems will pop up.

20:49.240 --> 21:00.040
Okay, and we have some time. So let me to list out some news in the kernel,

21:00.680 --> 21:08.120
kernel itself, got support for FPU, which is nice because we can use more performance.

21:08.120 --> 21:13.800
New Texas were moved out completely into user space. For the happy path,

21:13.880 --> 21:22.680
the thing you are watching right now is actually running inside CMRX. So it's not the presentation.

21:22.680 --> 21:28.360
It's an actual application, running on an actual operating system, running on top of Linux.

21:28.360 --> 21:35.720
It's an automation. We are using Linux as a hardware platform. Another platform which was

21:35.720 --> 21:44.200
added is Rmv8 and community slowly growing, with more ports coming, namely risk 5 port,

21:45.160 --> 21:53.720
which is contributed by another guy. So thank you. Are there any questions?

21:53.800 --> 21:57.800
Questions?

22:01.800 --> 22:08.600
Correctively, you introduce an intentional race possibility in the transaction period, and I'm just wondering,

22:08.600 --> 22:14.120
so isn't that technically undefined behavior, and does that product problems in practice, or was?

22:14.120 --> 22:21.400
Uh, depends on what you ask. If you ask something from academic sphere, they will tell you yes.

22:22.360 --> 22:32.280
Uh, if you ask someone from the field, they will tell you yes, but if you survive it,

22:32.840 --> 22:39.720
you will be, you will be told that it was an, and you, you went through the field of undefined

22:40.600 --> 22:47.080
circumstances, so you should probably throw the result away. And it's up to you to decide what you

22:47.160 --> 22:55.320
actually do, because you may say, okay, whatever I computed is a garbage, but the conditions,

22:55.320 --> 23:00.840
which led to this garbage, which you actually control in that application, because you decides

23:00.840 --> 23:07.240
the transaction domain, with using it, what, what is the semantics of the data you cover by

23:07.240 --> 23:13.880
transaction, like in the scheduler case, it may mean, okay, someone did it for me, so I just did up.

23:17.320 --> 23:23.080
Have you looked into read copy update, because that's a scheme that sort of should provide you with

23:23.080 --> 23:29.080
some more properties like readers, not being synchronized among each other, writers not being

23:29.080 --> 23:34.520
brought by the readers, readers not being brought by the writers, and I would say the main difference

23:34.520 --> 23:39.800
is that compared to transactions where you have to tolerate failed transactions, in the read copy

23:39.800 --> 23:46.120
update, you have to tolerate that the readers might see still data sometimes. So would that serve you

23:47.160 --> 23:53.560
well or not, and if not, why? Okay, the question was if I check the read copy update,

23:54.600 --> 24:01.480
I think it's from Linux, or, okay, yeah, about the Linux is it, the answer is I haven't,

24:03.000 --> 24:14.600
because somehow when doing my research, it led me to some obscure mechanism for very specific

24:15.080 --> 24:23.240
situations, and then I switch to book free, so I skip this, but it's not read as well,

24:23.240 --> 24:29.320
it's a different approach to the same problem, and you have different trade movements.

24:29.320 --> 24:34.120
Yeah, so now I haven't studied, but you are actually the second person who mentioned it, so I will

24:34.120 --> 24:42.600
definitely study it. In the forward row, yep, when you say like,

24:42.680 --> 24:47.960
lock free decision, like you introduce what I would call intersectional memory, or in this case,

24:47.960 --> 24:54.360
of intersectional memory in the front row, and when you talk about lock free, you talk about

24:56.360 --> 25:02.200
that's not very interesting, I'm sorry, I don't know, it's not fair if you're talking about

25:02.200 --> 25:10.360
the good meaning of lock free, or something else. Yeah, so like it's not completely lock free,

25:10.440 --> 25:16.600
it's inspired by lock free, and you actually have the good, you have good points, and the question

25:16.600 --> 25:26.360
was, if it is lock free, if it actually locks, right, kind of, and the truth is that it actually,

25:27.400 --> 25:35.480
you know, when you go multicore, this locking will actually be quite useless, because another

25:35.560 --> 25:44.360
core can bypass you, unless you do interprocess, signaling, and lock everything down, and when

25:44.360 --> 25:51.960
you are resilient enough to survive the hell before the comet, then you actually don't need to lock.

25:53.960 --> 25:58.600
I think I'll just suggest that you take a look at the sort of transactional memory.

25:58.600 --> 26:07.080
I think that we are out of time, so we will take questions outside. Thank you.

