WEBVTT

00:00.000 --> 00:14.400
Hi, so my name is Daniel Moy. I'm a senior software developer at Calabra, and I've been working on

00:14.400 --> 00:22.880
Jason Murphy for a while. So at Calabra, I lead the effort working on adding analytics

00:22.880 --> 00:32.720
supports to Jason Murphy. And also, I'm a Jason Murphy developer. So, first let's talk about

00:32.720 --> 00:39.920
Jason Murphy. It's like probably a lot of you already know it. It's a major multimedia framework,

00:40.480 --> 00:47.840
open source. It exists for 26 years now, I think, and it's maintained by thriving and very

00:48.000 --> 00:53.680
welcoming communities. So, if you have any interest, I invite any of you to join. I'm relatively

00:53.680 --> 01:03.920
new to the community. So, I went with it. Yeah. It has extensive plug-in system. It's a proven

01:03.920 --> 01:11.840
architecture, very flexible, and you know, composable, and it's been used for analytics for a long

01:11.840 --> 01:19.520
time. But more recently, with all the analytics, all the major silicon vendor adopted it,

01:20.560 --> 01:29.200
fork it, and I have to say specialize it on their framework. And so, why do we have GST analytics?

01:29.200 --> 01:34.720
Well, because we want the core of this to be available in the upstream Gstreamer, not necessarily

01:34.720 --> 01:39.920
in the vendor fork. So, this way, when you decide to move to a different platform, you don't have

01:39.920 --> 01:47.440
to restart pretty much all what you've done for scratch. And so, based on this, what we want to offer

01:47.440 --> 01:54.240
is a loose coupling with, we want you to have the access to the acceleration, but in a loose

01:54.240 --> 02:02.560
couple of ways, like it is done for a codec and everything else on Gstreamer. And yeah, it's

02:02.560 --> 02:10.400
we, in the way we do this, we really adopt the Gstreamer way, the mechanism, and in some case,

02:10.400 --> 02:18.880
if the mechanism doesn't fit it, then we adapted to the analytics needs and other elements in

02:18.880 --> 02:24.400
Gstreamer can benefit from it. And most importantly, I think that it's a community-driven

02:24.720 --> 02:34.880
effort. We get inputs from very experienced developer and reviews, and we understand the needs

02:34.880 --> 02:41.680
of more people because the, like, they can bring it to the project. Just a little bit about

02:41.680 --> 02:47.040
what is GST analytics? Well, it's a set of elements that are a bit specialized for this.

02:47.040 --> 02:54.320
For example, on X-Inference, TFLite, Burn, those are the elements that encapsulate the

02:54.320 --> 03:02.480
the inference framework, but because the in machine learning, the outputs normally is very

03:02.480 --> 03:09.440
specific to the models. So, we have elements, what we call a tense-ready code, they translate

03:09.440 --> 03:16.880
the output from the models to a more standard way inside Gstreamer that other elements in

03:16.880 --> 03:22.800
use, again, decoupling it from the model. So, we can think about if you have an object detection

03:22.800 --> 03:28.800
or a segmentation or another model dealing with voice. Well, it's not necessarily specific to

03:28.800 --> 03:33.840
that model. You can build an application that's using it without being specific to that model.

03:33.840 --> 03:41.760
If you find later on that you have another one that is more efficient or more suited for your platform,

03:41.760 --> 03:47.040
you're able to change it without, you're able to change the model you're using without impacting

03:47.040 --> 03:54.720
your application. So, yeah, consumers. So, once we've produced those results, we have,

03:55.840 --> 03:59.920
we've built a couple of elements that can use it, like we have overlays on the

04:00.480 --> 04:07.520
serializer, this serializer, and yeah, and the latest is, I bet earlier I was talking about

04:07.600 --> 04:13.680
we are adopting the built-in mechanism. In Gstreamer, for example, you know, it's, we

04:13.680 --> 04:23.120
Gstreamer use negotiation to find the compatible elements that are, like, based on capabilities.

04:23.120 --> 04:28.400
So, for example, in inference elements, we'll produce capabilities that are specific to this model,

04:28.400 --> 04:35.040
and then the tensor decoder have the capabilities that describe the capabilities it can support.

04:35.120 --> 04:40.400
So, this way, they can be negotiated, and so this allows us to create a tensor decoder bin. So,

04:40.400 --> 04:46.480
we can select the right tensor decoder based on the model that is used. So, it's simplified,

04:46.480 --> 04:57.040
again, building an application on top of it. We, based on that. So, in Gstreamer, we have met us,

04:57.200 --> 05:03.680
met us or attached to a buffer, and we extended this to, to, to, to, for analysis, for example,

05:03.680 --> 05:09.680
you know, we can describe object detection, classification, key points, and so on. So, this is,

05:10.960 --> 05:17.840
the, the meta framework is very flexible. I had the, had the, had the basis just a matrix

05:17.840 --> 05:23.760
that define relation between different met us and, and, and container. So, you're able to

05:23.840 --> 05:32.400
create new met as, as you, as you need, but you have already multiple examples to follow

05:33.280 --> 05:40.560
to do this. And, yeah, as of in, one that one TA that was released last week. Now, we have,

05:40.560 --> 05:45.920
like, yeah, we, the, the negotiation is available inside, between tensor decoders and the

05:46.000 --> 05:54.160
inference is available inside Gstreamer. Here's a typical, very simple analytics pipeline.

05:54.160 --> 06:00.240
You normally have a sort of data preprocessing, and analysis, post processing, and analytics.

06:00.240 --> 06:06.320
And here I just mapped it to a simple pipeline. You know, there's a lot of Gstreamer

06:08.240 --> 06:15.840
element that can already do most of the preprocessing we do. Some of it is still, it's indeed

06:15.840 --> 06:21.440
inference, because we didn't have, for example, four floats support. We don't have a, a, a

06:22.800 --> 06:32.000
video type, that support, a floating point, but we are adding this as, like, we are working on this,

06:32.000 --> 06:39.840
it should be available soon. So, yeah, for the next part, I'll switch to a bit demo,

06:39.840 --> 06:46.880
so I'll show, I didn't know, and explain a bit more what is, uh, what that was talking about here.

06:47.520 --> 06:56.560
So, yeah. So, the, the first demo, it's a segmentation. So, yep. So, I have a segmentation

06:56.560 --> 07:02.400
of running inside the Gstreamer pipeline. So, and, oh, yeah, forgot. So, if I have multiple object,

07:02.400 --> 07:07.360
you know, they should be segmented out. So, that, that's just an example of segmentation.

07:08.240 --> 07:13.440
Here, there's a couple of things to know. Like, they're, again, it's, uh, the, the type of object,

07:13.440 --> 07:20.640
or, uh, different MTD, like, analytics meta with the segmentation, and they're put in relation

07:20.640 --> 07:30.640
together. So, it's very, can adapt it to your needs. So, the next one, uh, I'll have a couple

07:30.720 --> 07:37.360
one that are sort of, uh, built to, uh, a final goal that we'll see a bit later. Um, this should be

07:37.360 --> 07:50.880
detecting the, uh, hopefully it will. It seems like it doesn't, uh, shoot. That's sad.

07:51.680 --> 08:07.360
Oh, anyway. I don't know, maybe the, the light is, uh, oh, seems like, oh, yeah, here.

08:07.920 --> 08:12.720
Maybe. Yeah. So, we had, it detect my hand and, with an angle, like, it's a rotated box. So,

08:12.720 --> 08:19.680
it detect the orientation of my hand to you. And, um, so, I'll go to the next one to explain a little

08:20.640 --> 08:27.920
bit how this is, uh, done. Um, so actually, when it's detecting my hand, it's detecting specific

08:27.920 --> 08:34.080
position on, on it to, to, to understand what's the orientation. And that's how the, the box is rotated.

08:34.080 --> 08:41.440
And here, it's actually, to, to be able to show you that I'm using the, the relation meta that are

08:41.440 --> 08:45.680
able to put to, you know, the, say, like, there's, uh, there's a, there's a bounding box that

08:45.680 --> 08:52.160
this is associated with key points. And, uh, uh, that's how the visualization is working.

08:53.920 --> 09:00.640
Uh, so, if I go to the next one. So, um, actually, again, this is, we're building to, like,

09:00.640 --> 09:08.560
now we are just involving one model. We're building toward, um, oh, yeah. So, now you see my hand is,

09:08.560 --> 09:14.880
like, a horizontal, but it appear vertical. That's to simplify, like, that's the requirement for the

09:14.960 --> 09:20.960
next model. Um, so yeah, it's showing upright on the, on the window. So, this, uh, allow to,

09:21.760 --> 09:27.440
that's, now, at this point, the, the, the first inference is the pre-processing of the next model.

09:30.400 --> 09:37.840
And, uh, here. So, uh, uh,

09:38.480 --> 09:50.240
you guys should have, okay. Oh, yeah. Well, maybe if I just position myself a bit like this.

09:54.960 --> 10:07.440
Sorry. Yeah. Yeah. Exactly. So, yeah. So, the pre-processing, like, now I can detect, um,

10:08.160 --> 10:16.320
key points from my hands, uh, and, uh, yeah. So, so, so the first model was the pre-processing,

10:16.320 --> 10:27.600
and then it feeds to a second model that is doing the landmark detection. And, uh, yeah. So, the next step

10:27.600 --> 10:36.000
is actually, so I wanted to, to use, uh, so based on this, I can detect, like, the specific position

10:36.000 --> 10:43.360
of my, my, the key points of my hand and do, uh, I'm flying language recognition. So, with a bit

10:43.440 --> 11:10.400
light, it's going to work. So, uh, oh. Oh. Oh. Oh. Oh. Okay. So, uh, oh,

11:10.480 --> 11:28.720
pushing a key. Oh, that's not the right thing. So, uh, unfortunately, I, like, it doesn't seem

11:28.720 --> 11:34.960
to work here. But, uh, the idea is that, uh, I would feed the key points in, uh, in a second model

11:35.040 --> 11:41.680
that would classify the position of the key point and recognize the, uh, the letter, like,

11:41.680 --> 11:47.840
that I, I'm, I'm describing. And, um, and then I, I, I, I was able to, you know,

11:47.840 --> 11:54.240
the tech on, just on, based on time, which letter and, uh, in words and describe, um, yeah.

11:57.840 --> 12:04.560
So, back to my presentation. Yeah. So, we still can improve, uh, improve this,

12:05.520 --> 12:13.040
and, uh, maybe add also ASL word, which is more than just, uh, there's a, uh, it's, oh, yeah.

12:13.040 --> 12:18.960
It's, uh, there's a video analytics, and here's, it was just, uh, image analytics, but we could

12:18.960 --> 12:24.320
recognize more complex language. And actually, we, one thing we were thinking is actually,

12:24.320 --> 12:29.920
there's a, we have, there's an application called cdwet that is based on pipeline, uh, on pipe wire,

12:29.920 --> 12:38.240
and, uh, uh, uh, which were, the goal is to, uh, do a virtual, uh, virtual camera, where it can

12:38.240 --> 12:44.000
remove background from, from the camera, but independently from the application you're using.

12:44.000 --> 12:49.680
So, if, if they're, if they, they're background removal, whatever video conferencing you're using,

12:49.680 --> 12:54.960
there's a bit crappy, you can use your own, and, uh, also, um, so it could do background removal,

12:55.040 --> 13:00.960
but it's also could do the hand sign recognition. So you could generate a transcribe of what people

13:00.960 --> 13:08.480
are like kind of signing basically and, uh, even going further, we could, uh, send the transcript

13:08.480 --> 13:14.320
to whisper speech to, uh, uh, you know. Uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh,

13:14.320 --> 13:19.040
uh, uh, uh, uh, uh, uh, uh, uh speech synthesis from, uh, from the transcriber. So that's all, it's just

13:19.040 --> 13:23.780
This example of things that we are doing, but I'm mostly working on the infrastructure

13:23.780 --> 13:30.400
to get those models and all the free infrastructure to get analytics working in GSMR.

13:30.400 --> 13:40.000
But I think we've, yeah, we went far, so I'm inviting everyone to look at this.

13:40.000 --> 13:46.840
So in summary, GSMR offers a very highly composable framework for a building analytics pipeline.

13:46.840 --> 13:52.040
It has a powerful metadata framework based on graph.

13:52.040 --> 13:56.040
So there is no assumption on what can be related.

13:56.040 --> 14:03.440
So which helps you to describe better, like the scene, what's based on your intent,

14:03.440 --> 14:06.040
and it has a loose coupling with the acceleration.

14:06.040 --> 14:13.440
So it should facilitate porting to other hardware or, yeah.

14:14.440 --> 14:18.440
So that's all, thank you.

14:24.440 --> 14:27.440
Can you take questions?

14:27.440 --> 14:28.440
Yes?

14:28.440 --> 14:33.440
What is the, what is the, just remember version you're using?

14:33.440 --> 14:36.440
It's 128.

14:36.440 --> 14:37.440
Yes.

14:37.440 --> 14:48.440
So the question was, which version of GSMR we were using for this is, I was using 128.

14:48.440 --> 14:55.440
But depending we exactly which element you're, that was there.

14:55.440 --> 15:03.440
There are available since 124, like for example, on X inference is there since 124, I think.

15:03.440 --> 15:13.440
Yeah, as of 128, all the, the one that I've used are available, yeah.

15:19.440 --> 15:23.440
Yeah, so actually, I just run on the CPU.

15:23.440 --> 15:27.440
So the question is, which hardware I'm using.

15:27.440 --> 15:31.440
So I'm only using a CPU here.

15:31.440 --> 15:37.440
So, yeah, AMD or ARM 7, yeah.

15:45.440 --> 15:51.440
Okay, yeah, the question is, from on X, which packet, what I was using.

15:51.440 --> 15:53.440
So I'm using on X run time.

15:53.440 --> 15:58.440
I could have you, like, if I use the, there's also on X run time GPU,

15:58.440 --> 16:04.440
I didn't have, and didn't have time to set up all the, the hardware to get all the acceleration.

16:04.440 --> 16:11.440
So, but all this is on CPU, it would be faster if I run it on the GPU, of course.

16:11.440 --> 16:12.440
Yeah.

16:12.440 --> 16:20.440
It is possible in just primar to, is possible to put this on GPU in just primar?

16:20.440 --> 16:21.440
Yes.

16:21.440 --> 16:22.440
Yes.

16:22.440 --> 16:27.440
So, so, so, is it possible to put this on,

16:27.440 --> 16:30.440
to put this on a GPU? Yes.

16:30.440 --> 16:32.440
So, because we use on X run time.

16:32.440 --> 16:34.440
So, there's an abstraction there.

16:34.440 --> 16:40.440
There's, there's different back end to, on X run time.

16:40.440 --> 16:43.440
And so, there's CUDA tensor RT.

16:43.440 --> 16:47.440
There's also, well, a lot of others,

16:47.440 --> 16:52.440
VSI and, like, yeah, most of this is,

16:52.440 --> 16:55.440
I have a back end for on X run.

16:55.440 --> 17:06.440
Well, they, if you, if you just build on X run time with those back end,

17:06.440 --> 17:11.440
okay, if the question is, if it's available with those plugin,

17:11.440 --> 17:15.440
this will be the same plugin as we use here, is just when you build on X run time,

17:15.440 --> 17:19.440
you'll have to build it with the, the specific back end.

17:20.440 --> 17:21.440
So, yeah.

17:21.440 --> 17:24.440
For TF, just a bit related to this.

17:24.440 --> 17:27.440
For TF light is a little bit different,

17:27.440 --> 17:31.440
but we, we have, acceleration, like,

17:31.440 --> 17:36.440
a delegate available for the VSI, which, yeah.

17:38.440 --> 17:39.440
Yes.

17:39.440 --> 17:42.440
Can you?

17:42.440 --> 17:55.440
Can you?

17:55.440 --> 17:58.440
Can you?

17:58.440 --> 18:04.440
Can you?

18:04.440 --> 18:06.440
Okay.

18:06.440 --> 18:28.840
So we try to, like, we don't, so the question is, are we utilizing the vendors board package

18:28.840 --> 18:42.740
for, for, for the, for the inference, but so we try to keep it to the minimum, and so we're

18:42.740 --> 18:48.640
not using the, the, the vendors to look and package directly, other that, like, it would

18:48.640 --> 18:55.040
be, you know, abstracted by the inference elements. So it's, don't use the full framework, if

18:55.040 --> 19:21.040
that was your question. Yes. Exactly. Any other question? Thank you.

19:25.040 --> 19:32.040
Thank you.

