WEBVTT

00:00.000 --> 00:23.160
Okay, let's start right into it. The next talk is about combining traces, namely the

00:23.160 --> 00:30.600
kernel F trace and the editing and the user space trace. So, who am I? That's me, I'm Felix,

00:30.600 --> 00:35.840
I work as a embedded Linux developer at Siemens Foundation Technology in Munich. And if you

00:35.840 --> 00:40.880
ever want to contact me or reach out to me, this is where you find me, I'll get to help

00:40.880 --> 00:49.680
master it on and also via email. So, before we start with the actual technical stuff, I

00:49.920 --> 00:58.560
click Paul, who of you has worked with F trace? Ah, that's very good. With U.S. work with

00:58.560 --> 01:07.400
editing and G tracing. Okay, let's people, user space or kernel, kernel, user space?

01:07.400 --> 01:14.120
Both, okay, okay. And you of you use both F trace and editing and G tracing at the same

01:14.120 --> 01:23.640
time. No one. Maybe that changes from now on. Let's see. So, where do we come from? This is

01:23.640 --> 01:28.480
not an artificial problem we want to solve here, but one we actually had in the field. So,

01:28.480 --> 01:34.400
we needed tooling to analyze a complex real-time communication box. So, in the field, we had a device

01:34.400 --> 01:40.840
and there did some real-time communication. And the device was also secure booted, that means

01:41.000 --> 01:48.000
you cannot easily make modifications to the kernel or load kernel modules. In general, I really

01:48.000 --> 01:54.640
dislike out of three modules. So, the editing and G module from this perspective is a bit problematic.

01:54.640 --> 01:59.960
And apart from that, it was also really hard to exchange a replace be kernel we are running

01:59.960 --> 02:06.760
there. Simply because it was tedious. So, it was easier to use the infrastructure we already

02:06.760 --> 02:13.480
had in the kernel. And combined that was the things we can exchange. Like, we were able

02:13.480 --> 02:17.800
to deploy containers there. Also, privileged containers, which is good if you want to

02:17.800 --> 02:25.800
configure the tracing infrastructure. And by that, we needed a solution, which worked for our

02:25.800 --> 02:33.040
developers, which are familiar with the trace compass tool to analyze traces. And in some

02:33.040 --> 02:40.040
of our applications, we also already had user space trace points. So, LTT and GUST trace points.

02:40.040 --> 02:43.840
And the developers really liked them because they are low overhead. If you turn them off,

02:43.840 --> 02:47.640
they are also low overhead. If you turn them on, but of course, you have some. And they

02:47.640 --> 02:54.640
are quite easy to integrate into C or C++ applications. So, we made a plan with the goal

02:54.640 --> 03:02.160
to instrument the applications that do not yet have the trace points. In it was LTT and

03:02.160 --> 03:09.040
GUST trace points. And use the F trace infrastructure in the kernel. And in the end,

03:09.040 --> 03:16.220
then the developers can use the trace compass tool. Also, from the officials to view the combined

03:16.220 --> 03:21.360
trace. And it was combined. I mean, they should not view two traces, but they really want

03:21.360 --> 03:28.720
to view a perfectly time synchronized trace, both from GUST trace and kernel. For the

03:28.800 --> 03:34.360
diligence, there is an F trace reader plug-in for trace compass. We tried it, but it was

03:34.360 --> 03:41.240
horribly slow. And in the end, we also wanted a more streamlined solution. So, from that

03:41.240 --> 03:45.880
we derived some requirements, like we want to convert the kernel F trace into the common

03:45.880 --> 03:50.440
trace format, because that is what the trace compass application can read. Also, some other

03:50.440 --> 03:57.280
tracing tools. And another really important thing is the trace tracing and the analysis

03:57.280 --> 04:01.240
of the trace should happen on two different systems, because the system we have in the

04:01.240 --> 04:06.520
field is an embedded device. It doesn't have a UI or whatever. So, we need a way to capture

04:06.520 --> 04:12.400
the traces there. Copy them over to the machine where we analyze things. And then run

04:12.400 --> 04:19.200
the analysis there. I was that before we start implementing, we had to look at some conceptual

04:19.200 --> 04:24.400
challenges. One was how to read the trace, buff us in a portable way, this is what I just

04:24.400 --> 04:31.120
said, so that you can trace on one system, copied over to the other one. Then how to convert

04:31.120 --> 04:36.160
thousands of trace events, like basically all the F trace events, without writing, I want

04:36.160 --> 04:41.440
to one conversion for each of the events, because that is really, really tedious. Then

04:41.440 --> 04:51.720
how to synchronize the trace clocks, both in user space and kernel. And finally, no solution

04:51.720 --> 04:55.680
really works out if the developers don't accept it. So, you know the solution which also

04:55.680 --> 05:03.040
works for non-experts like application developers, for instance. And with that, let's dive

05:03.040 --> 05:08.640
into it. The first challenge is how to read the trace buff us. I hope you can see it in

05:08.640 --> 05:15.800
the dark mode here as well. In the diagram, you see that we use trace command on the device,

05:15.800 --> 05:21.040
which in the end is just a front end to the trace of S in the kernel, which you can ship

05:21.040 --> 05:26.680
in a privileged container, that the trace command binary. And with that, we can record both

05:26.680 --> 05:31.520
the flight records, so we trace itself, the binary trace as well as all the metadata we need,

05:31.520 --> 05:40.240
like NDNS, the event descriptions, process command lines, and a couple more things. And

05:40.240 --> 05:46.640
then we copy this thing over to the other system where we do the analysis. And there, we

05:46.640 --> 05:52.400
have these great libraries from Stephen. The trace command, the trace event and the trace

05:52.400 --> 05:59.320
of S to open up the file and extract the data there. And in the trace, thought file, you

05:59.320 --> 06:03.800
have basically a set of metadata, which are the kernel command lines, the descriptions

06:03.800 --> 06:09.920
of the events, event classes, and then the flight records, which are usually one buffer

06:09.920 --> 06:16.120
per CPU, but you can also have more buffers per CPU if you use sub buffers, for instance.

06:16.120 --> 06:21.600
And then we use the right-line event converter, which then converts the events, which

06:21.600 --> 06:26.080
we have in the flight record from one representation, like the kernel representation, into

06:26.080 --> 06:35.080
the intermediate representation. So the next thing next challenge here is how to create

06:35.080 --> 06:42.320
the common trace from a trace. And for that, there's the great bubble trace tool and

06:42.320 --> 06:48.960
the library, which is in the end of the screen processing application, which is plug-in

06:48.960 --> 06:56.120
base. So you have a set of input plugins, which can read an arbitrary format. And these

06:56.120 --> 07:03.840
are the green ones here, which that we can implement plug-in, which can read the F trace,

07:03.840 --> 07:10.480
on the trace, and then you have a set of max plug-ins of filter plugins, which combine

07:10.480 --> 07:16.160
the traces from multiple input plugins. And then you have an output plugin. And in this

07:16.160 --> 07:22.160
case, we can just use the plugins, which are already there, like the CTF writer plugin,

07:22.160 --> 07:28.200
to then write our trace in this common trace format. And I would really recommend you, if

07:28.200 --> 07:33.640
you didn't do it before, to look into that project, because this is quite flexible. And

07:33.640 --> 07:39.720
once you get a grip on it, it's easy to implement basically any trace format A to trace

07:39.720 --> 07:48.680
format B conversions. So how to technically convert the trace events, the first thing, which

07:48.680 --> 07:55.680
is noticeable, is that the trace, the F trace events are plausible. There is not a binary trace

07:55.680 --> 08:01.000
format, it's not part of the kernel ABI, that means you need to capture these descriptions.

08:01.000 --> 08:06.900
But in the kernel, if you look at the trace entries here, like events, I accuse of the

08:06.900 --> 08:12.380
IQ for instance format. Each trace point has one of these format descriptions here.

08:12.380 --> 08:18.340
While you get the internal type definition of the trace event, like here, you see the types,

08:18.340 --> 08:26.020
you see the offset, the size. And by that, you can parse the data in the binary flight

08:26.020 --> 08:33.100
record. Also, the trace events are a name, like you have a system, a name, and then the

08:33.100 --> 08:38.020
format file. And the good thing is you don't have to do this parsing by yourself, because

08:38.020 --> 08:45.300
for that, you can use the lip trace event, which already can parse this textual representation

08:45.300 --> 08:52.340
of the event. And then, yeah, provide you a live, provide you APIs to extract the event type

08:52.340 --> 09:00.460
and all the field offsets and field types. On the other side, we have the Bobbi trace library,

09:00.460 --> 09:08.220
can programmatically create events, or more technically correct event classes. So, what

09:08.220 --> 09:15.980
we do there is we just implement a mapping, by reading the event classes out of the trace

09:15.980 --> 09:22.340
start, out of the trace that metadata, then we recreate them in the, in Bobbi trace, in the

09:22.340 --> 09:29.580
intermediate representation. And then we store these mapings in a hash table, because when

09:29.660 --> 09:35.420
we read the flight record, we just get event IDs. And this is potentially hundreds of

09:35.420 --> 09:40.460
megabate bytes of data. This really needs to be efficient. Otherwise, the whole implementation

09:40.460 --> 09:46.460
is horribly slow. And even as it's out of bandwidth, still, you want to have some performance

09:46.460 --> 09:54.900
otherwise, it just takes too long to do the conversion of the traces. The second part

09:54.980 --> 10:01.300
is semantics. So, as I said, there is this LTTNG kernel module, which directly can write

10:01.300 --> 10:09.460
the common trace format traces. But it has some slight differences in the semantics of the

10:09.460 --> 10:17.300
events compared to the F trace. So, one thing is that in the F trace, you have the events

10:17.380 --> 10:25.140
named after the system and then the name. And in LTTNG, you don't have that system name. So,

10:25.140 --> 10:29.540
for events like the IRQ events, where you have the system IRQ, and then an event which doesn't

10:29.540 --> 10:37.620
start with the IRQ prefix, you have to add the prefix so that then the trace tooling, the

10:37.620 --> 10:43.220
analysis tooling like trace compass, is able to understand that this now actually is an IRQ event,

10:43.220 --> 10:49.220
because it expects the LTTNG semantics and not the F trace semantics. So, this part of the conversion

10:49.220 --> 10:56.020
is easy. And then you also have field name conversions like in the kernel, you have the process

10:56.020 --> 11:03.460
ID or pit. And LTTNG consistently uses the term thread ID or tit for the same thing.

11:05.060 --> 11:09.220
Then you have some field values, you have to convert like in the scheduling events, you have

11:09.380 --> 11:16.420
the priority values. In LTTNG, they are adjusted by the max RT price. So, you have to subtract

11:16.420 --> 11:27.060
100 in this case. And then there are some events which only exist in the F trace or they only

11:27.060 --> 11:35.460
exist in the LTTNG kernel module trace. For the F trace events, we can just add them because in

11:36.420 --> 11:39.540
the analysis tooling doesn't care if they are more events which it can't understand.

11:40.900 --> 11:46.420
For events, we don't have an F trace, we can't do anything because we only have F trace on

11:46.420 --> 11:51.860
the system. So, either we add these trace events to the kernel and eventually profit from that

11:51.860 --> 11:59.220
or we just accept that we don't have them. Another quite funny thing is that the LTTNG pit

11:59.300 --> 12:08.660
refers to the LTTNG, the kernel pit is named tit. And in LTTNG, we also have a pit. But this

12:08.660 --> 12:16.820
refers to the kernel thread group ID or tit, which unfortunately is not captured in the

12:16.820 --> 12:23.300
scared process for event. So, in the F trace case, we don't have that event, but in LTTNG,

12:23.940 --> 12:29.060
we want to have it for the analysis. It also works without that, but this is something I want

12:29.060 --> 12:34.740
to contribute to the kernel to also capture that in the escape process for event. I guess it's

12:34.740 --> 12:44.340
cheap to capture and because then we can really do one-to-one mapping. The next challenge is how to

12:44.340 --> 12:52.740
synchronize the trace clocks. And this might look simpler than it actually is. So, the first thing,

12:52.900 --> 12:59.300
you have to be aware of that the user space trace and the F trace are captured individually.

12:59.300 --> 13:06.100
There's no synchronization between the two. So, what you do is you start the F trace and then

13:06.100 --> 13:13.620
start the LTTNG user space trace or the other way around. And once you run into the issue or

13:13.620 --> 13:18.660
into the problem, you want to analyze, you have to stop both. For the F trace, you can just use a

13:18.660 --> 13:25.860
trigger which then turns off the tracing for LTTNG. There's a similar thing you can use,

13:25.860 --> 13:33.940
but you also have to program it individually. And if you use two different trace clocks for

13:33.940 --> 13:40.740
both traces, there is really no way to reliably correlate them at a micro-sevent or below that

13:40.740 --> 13:48.580
granularity. So, what you need to do is to use the same clock and the LTTNG user space uses the

13:48.580 --> 13:55.780
clock monotonic. In the kernel, usually the trace clock is used, but you can switch steps and

13:55.780 --> 14:01.220
by switching it to clock monotonic, both traces use the same clock. So, they are running at the same

14:01.220 --> 14:08.580
pace and by that they are they are already aligned. Then there's one more thing to consider that

14:08.580 --> 14:16.660
is in LTTNG user space. The trace events are also adjusted by the world clock. But luckily,

14:16.740 --> 14:26.420
the offset to the world clock is also recorded in the trace metadata. So, for a particular value

14:26.420 --> 14:35.220
of the clock monotonic, what the value of the world clock was there. So, what we do then is to extract

14:35.220 --> 14:41.860
that and just add that offset to the F trace clock as well when doing the conversion. And by that,

14:41.940 --> 14:47.860
we have perfectly aligned traces, which are which is quite good if you look at the larger scale,

14:47.860 --> 14:57.220
which are even a world clock adjusted. How to make it usable for non-experts? The first thing

14:57.220 --> 15:04.180
is bubble trace provides a very interface that means it can discover all the plugins. And then

15:04.180 --> 15:09.620
when you run bubble trace like here where you know we don't have it here, here we have it bubble

15:09.700 --> 15:15.780
trace and then you provide a file. And bubble trace will all ask all the plugins. Can you read

15:15.780 --> 15:24.340
that file? Or do you have some processing mechanism for this input string here? And then the

15:24.340 --> 15:31.700
plugin can say either yes or no of some confidence. And then the plugin was the highest confidence

15:31.700 --> 15:39.780
wins and bubble trace fires the plugin up and registers it as a source plugin. And by that,

15:39.780 --> 15:46.340
you can just use this short format here, bubble trace, trace starts to already read the trace

15:46.340 --> 15:54.580
start and the events from there and then they just grow your console. The second thing is we don't

15:54.580 --> 15:59.460
want to have them on the console, but we want to convert them to the common trace format. So,

15:59.540 --> 16:05.700
we wrote a replication F trace to CTF which in the end does exactly that it takes and

16:05.700 --> 16:14.900
trace start file and it then converts it to the common trace format and here you specify an

16:14.900 --> 16:22.420
output directory where then the traces are written to. And the easiest user faces one with a lot

16:22.420 --> 16:30.420
of command line parameters. No, I'm just joking. Usually you just can call it, usually you can

16:30.420 --> 16:36.180
just call it like that. So, we have here this F trace to CTF. Then you specify I want to have

16:36.180 --> 16:41.940
ATT and Gsementic. So, I also do the semantic conversion where I convert the pitch to the

16:41.940 --> 16:49.940
tit and so on and the event names. Then I provide my user space trace. This is also a directory in

16:50.020 --> 16:56.660
this case and the output directory. And what the tool internally does, it first reads the first event

16:56.660 --> 17:03.460
from the user space trace. From there we extract the clock definition to get the offsets to the

17:03.460 --> 17:11.940
wall clock. And then we set up the second trace processing graph where we feed in the trace

17:11.940 --> 17:21.060
that file and some metadata like the clock definitions. And we also register the CTF reader to read

17:21.060 --> 17:29.220
the user space trace through that as well. And then we just follow the standard bubble trace logic

17:29.220 --> 17:36.580
that we use a max to interleave the traces and by time and then the CTF thing to write them

17:36.580 --> 17:43.620
out to a directory. And this is what you all were looking for. Probably here we have a snapshot

17:43.620 --> 17:51.940
of the trace compass application. It's a bit small here but you have to trust me. There we use

17:51.940 --> 17:57.380
the tit of the bugger application and edit a couple of user space trace points and we use the

17:57.380 --> 18:04.100
F trace to capture the scheduling trace event. And here you'll see that the trace compass is

18:04.100 --> 18:12.420
able to correct the analyze these traces and also show them together. So with that we are done

18:12.420 --> 18:17.140
and I think we have time for questions. If you're interested in the project have a look at

18:17.140 --> 18:23.780
GitHub Siemens BD2F trace to CTF. And while implementing that we also did a couple of upstream

18:23.940 --> 18:31.620
contributions mostly bug fixes which I also list here.

18:39.620 --> 18:45.700
Thank you, Kultalk. So what I wanted to see in the picture before. Do you have like a

18:46.660 --> 18:52.980
drill down from the user space trace point like start an end for a function and sort of

18:52.980 --> 18:59.060
align them to see like like costac wise basically. So to see like it goes to an IO function

18:59.060 --> 19:05.940
to go to some kernel so IRQ. Yes, so in theory you could use the function graph tracing as well

19:05.940 --> 19:11.700
and then you would see the call stacks but in this case we only use trace events so we don't have

19:11.700 --> 19:19.060
a call stack but you can use the I think it's called red analysis or control flow here and

19:19.220 --> 19:25.140
the bottom you see it where you see the control flows through user space and then kernel as well.

19:26.420 --> 19:27.140
Thank you.