WEBVTT

00:00.000 --> 00:13.440
Okay, so my name is Lucas Tavarek. I'm a technical leader at Intel and today's talk is about AI-based failure

00:13.440 --> 00:23.560
aggregation. So in our CI pipelines we often see hundreds of test failures every day and most

00:23.560 --> 00:31.720
of them are very similar but not exactly the same. Because of this minor differences, our

00:31.720 --> 00:39.800
engineers had to expect time to manually analyze these failures, repetitively like over and

00:39.800 --> 00:47.160
over again. And in today I will show you how we reduce the overall noise in the CI system with

00:47.160 --> 00:55.400
text embeddings. Let's start with a quick and high level overview of the original workflow that we

00:55.400 --> 01:03.720
used. So we had on the left-hand side regular CI CD pipelines that like build components and

01:03.720 --> 01:12.200
run tests. Then we have a monitoring system that automatically gets the test results from the CI CD

01:12.200 --> 01:20.280
pipeline, performs some initial analysis and then reports the test failures to an engineer.

01:21.160 --> 01:27.800
The engineer have to make a decision whether to update existing bug report with a new

01:27.800 --> 01:35.720
instance of a given failure or treat it as a completely new bug, create a new bug report and if

01:35.800 --> 01:43.800
needed also start a regression isolation that we will try to find the faulty commit and

01:43.800 --> 01:50.760
reverberate. Now let's talk about the scale which is quite important in this solution.

01:51.880 --> 01:59.160
So on the x-axis you have a number of tests on y-axis the percentage of failures. So from the

01:59.160 --> 02:07.000
scale perspective if you have a low level level of failures then you are in a happy place you

02:07.000 --> 02:13.400
often don't need some complicated system or process to handle these failures.

02:14.360 --> 02:20.120
If you have a relatively low number of tests and a high percentage of failures this is still

02:20.120 --> 02:27.080
manageable with some semi automatic way you can handle all of the failures. However in our case

02:27.080 --> 02:39.800
we had like hundreds of thousands of test basis executed each day and hundreds of test failures

02:40.760 --> 02:47.080
that needed to be analyzed every day by engineers. So this did not scale and we needed to find something

02:47.080 --> 02:56.680
better. So this is a overview of the desired workflow that we try to implement as you can see

02:56.840 --> 03:03.960
most of the elements of the diagram are exactly the same with one major change. The engineer

03:03.960 --> 03:11.240
or the human in the loop was removed from the like critical path. So we tried to create a system that

03:11.240 --> 03:22.200
will automatically get the test results and by itself make a decision whether to update the bug or

03:22.200 --> 03:29.000
create a new one and start the regression isolation. The only interaction with the engineer was to

03:29.000 --> 03:38.760
create to send a some other report for verification. So in short our goal was to create a fully

03:38.760 --> 03:47.480
automated agent and the main problem that we faced was to how to determine if a failure is a new issue

03:47.480 --> 03:55.080
or an already reported one. So here are a few potential solutions that we considered.

03:55.800 --> 04:02.280
So the first one the most basic one is to simply have no aggregation at all so each test failure

04:02.280 --> 04:09.320
is a new bug report but we simply like move the issue from the test reporting side to a

04:09.320 --> 04:15.400
bug management side. So we have a lot of duplicates and also a risk of a waste of resources as the

04:16.360 --> 04:25.000
we unnecessary start the regression isolation process. Then we could go like once the further

04:25.960 --> 04:33.000
and try to aggregate the failures per test piece. So if we have exactly the same test piece

04:33.000 --> 04:39.400
that is failing over multiple bills then we simply can try to treat them as a one bug report.

04:40.120 --> 04:47.720
However, we still can have duplicates because the same bug can affect multiple test cases and we

04:47.720 --> 04:53.960
still have like noise in the system. Another issue of this approach is so called nested

04:53.960 --> 05:00.840
regressions. So we have exactly the same test case that on one build is failing due to some

05:00.840 --> 05:06.680
numerical errors and on another build started to fail due to like segmentation fault. With this

05:06.760 --> 05:12.920
knife approach we will not detect this and we will treat still these two issues as a one bug.

05:14.200 --> 05:22.360
Next we could also try to do a direct log comparison. So simply get that error messages from

05:22.360 --> 05:31.640
test failures, try to normalize them. So like remove time stamps, memory addresses or anything like

05:31.640 --> 05:37.720
that that changes from build to build but didn't change the error signature. However, this

05:39.080 --> 05:46.280
text operations are highly like complex and rule based so we will need to like maintain this

05:46.280 --> 05:55.960
over and over again. Another option is a new machine learning model. However, with this

05:56.040 --> 06:03.560
basic approach, dozens scale and it's really time consuming and error prone mostly due to the

06:04.520 --> 06:11.560
need to prepare a dataset. We can't go with like a few hundreds of examples. We need a few

06:11.560 --> 06:17.960
thousands and we need to label this data which is quite expensive. Of course there's another way.

06:18.840 --> 06:25.560
Now let me introduce three main concepts that are essential to our solution.

06:28.040 --> 06:35.000
First one is text embedding so what is even a text embedding. It's simply a numerical vector

06:35.400 --> 06:44.680
in a multidimensional space that represents a meaning of a sentence. So if we convert the

06:44.760 --> 06:52.840
like sentences into a vectors, then vectors that are close to each other should represent a

06:52.840 --> 06:59.480
similar meaning. So the like worth in a sentence can be different but the meaning will be treated as

06:59.480 --> 07:08.840
the same. Then we have a vector similarity search. So if we have these vectors, then we need a way

07:09.560 --> 07:19.240
to decide how similar they are. To do this, we use the standard way which is a cosin similarity

07:19.240 --> 07:25.720
metric which is simply a cosin of an angle between two vectors. So if cosin similarity is

07:25.720 --> 07:35.480
near one then we treat such vectors as highly related and if the cosin similarity is near zero

07:35.480 --> 07:48.040
then we treat them as unrelated. Finally the buy and coder architecture. Here we have an example

07:48.040 --> 07:56.440
of a training stage of such architecture and the model. So we have input sentence A that we pass to the

07:56.440 --> 08:02.840
and coder model. In most cases this is some like AI model like Berter and things like that.

08:03.400 --> 08:09.960
That we create an embedding and then we repeat this process with another sentence that we want to

08:10.760 --> 08:18.840
compare with. We go over the same encoder module, generate another embedding and then based on the

08:18.840 --> 08:28.040
cosin similarity we decide whether these embeddings are similar or not. With value of cosin similarity

08:28.120 --> 08:39.560
we can then compare them with the reference data in our like validation data set and update the

08:39.560 --> 08:46.920
whites of the encoder or in other terms send a positive feedback to the encoder if the embedding

08:46.920 --> 08:54.920
was good and like we get expected value because I similarity or send a negative feedback

08:55.000 --> 09:05.640
if the like embeddings are far off of the reference value. That's the like theory part. Now let's go to

09:05.640 --> 09:15.400
the implementation. So first we use the sentence transformers package that is available on the

09:15.400 --> 09:23.720
pie pie it is written in Python. As you can see with one line of coder we can load the already

09:23.720 --> 09:31.960
pre-trained model that is available online. Then we can create some example sentences that we will

09:31.960 --> 09:41.560
work on and finally we can simply call the model.ncode with input sentences and in such

09:41.560 --> 09:49.000
case we will get three embeddings because we had like three input sentences each embedding with

09:49.000 --> 09:56.600
384 dimensions which is related to the model that we used at the beginning. Then we have some

09:56.600 --> 10:02.760
utility functions from the sentence transformers package in that case we have like similarity

10:02.760 --> 10:08.680
that we can call with the embeddings and we will get the cosin similarity between these vectors

10:08.680 --> 10:25.400
in like tensor object. If you are not using the Python ecosystem then I really recommend

10:25.400 --> 10:35.480
the hugging faces text embedding inference project which has some pre-built Docker images

10:35.800 --> 10:42.200
with like CPU support with GPU supports and it can also be easily deployed on the Kubernetes

10:42.200 --> 10:52.920
clusters. In such case we can easily scale the solution so the server can serve given model

10:52.920 --> 11:01.800
and accept multiple requests at the same time. In this case we have web interface so we simply

11:01.800 --> 11:11.000
send a web request the server with an input sentence that we want to embed and as an output

11:11.000 --> 11:22.280
we will get the embedding vector. Then the question is which model should I use? So here we

11:22.280 --> 11:28.280
comes the M tab leaderboard that is available online. Here are some like examples from the

11:28.280 --> 11:34.280
levered their board and a few essential parameters that you need to take care of.

11:36.280 --> 11:44.200
So the first one is memory usage so as you can see the first model uses around like 44 gigabytes

11:44.200 --> 11:52.040
of memory so to run is you need a really powerful GPU or even multiple GPUs. On the other hand

11:52.040 --> 11:59.240
the last example is the one from the sentence transformers example and it uses only like 100

11:59.240 --> 12:07.160
megabytes of memory and you can easily run it on CPU. Then we have embedding dimensions

12:08.360 --> 12:16.840
so more dimensions interrupt the interior leads to a better accuracy. However the it's a cost

12:16.920 --> 12:26.760
of storage that is required to store these vectors. Finally we have max tokens parameter which represents

12:26.760 --> 12:36.840
the length of a text that we can embed in a single go when we call the model. So if we have like

12:36.840 --> 12:45.160
large, last text input but the model can't handle it then it will in most cases try and get it

12:45.160 --> 12:54.120
to the like number of tokens that it supports. We have text embeddings we have a way to

12:54.120 --> 13:02.520
compare them then we need a way to store them. In such case we use the PG vector extension to post

13:02.520 --> 13:11.080
the SQL. With this extension we have a new column type for vectors so of course we can insert the

13:11.080 --> 13:19.320
vectors to the database alongside any other standard data that is available in postgres and then we

13:19.320 --> 13:29.880
have few dedicated operators that allows us to query the database to retrieve vectors that are

13:30.120 --> 13:44.360
related to each other. Here we have like overall overview of the workflow that we prepared so this

13:44.360 --> 13:52.120
is we have a Python orchestrator that automatically gets the log from Jenkins that runs the tests

13:53.080 --> 14:03.560
then we have we are extracting failures from the error signatures from the logs from the

14:03.560 --> 14:10.360
Jenkins. We pass it to the hugging faces text embedding inference instance to get the embeddings.

14:12.040 --> 14:18.680
We've started embedding we are going to post the SQL to get the information whether we already

14:18.680 --> 14:26.520
have seen similar error in the database. If that's the case then we create automatically create

14:26.520 --> 14:36.280
black report and also automatically start the git bisect and git forever workflow to

14:36.280 --> 14:45.240
revire to the faulty comment from the system. Of course if we have an embedding that already exists

14:45.320 --> 14:56.280
in the database then we only update the back reporting the database. As always there are

14:56.280 --> 15:05.560
there is a room for improvements so here are a few recommendations from our side first is to analyze

15:05.560 --> 15:14.040
more logs so there are two cases the first one is simply the case when you have a really long error

15:14.120 --> 15:20.040
messages that you want to embed. As I said previously the default option is to

15:20.040 --> 15:27.320
truncate this log file and by default most of the services will truncate it from the beginning.

15:28.120 --> 15:34.440
From our experience in most cases the error messages and the like real error signature

15:35.160 --> 15:41.880
is available at the end of the log so the like basic option is to simply truncate from the

15:41.960 --> 15:51.000
bottom of the log file and then you will get like better output. Another option if you have resources

15:51.000 --> 15:57.960
is to simply use the bigger model with more max tokens parameter that will handle your example.

15:59.640 --> 16:08.440
Then there is an option of multiple log files so for example you have error messages from your

16:08.440 --> 16:15.320
testing framework and then you have some like log files from let's say web server that is tested

16:15.880 --> 16:21.720
and you would like to aggregate the failures based on this to like input streams.

16:22.440 --> 16:30.040
So the basic option is to simply get these two logs, merge them into one file and the embed

16:30.040 --> 16:41.720
everything as a one vector. Another option if the logs are too big then you can try to use an

16:41.720 --> 16:50.600
LLM or any other option that we will try to select more important log. So if for example you have

16:50.600 --> 16:58.280
some segmentations faults then we maybe we don't need to like analyze some basic log files and

16:58.280 --> 17:05.880
go straight to like some system core files or anything like that. Another option is to embed only

17:05.880 --> 17:13.960
error signatures so if you have like few thousands of lines in the log file and the error

17:13.960 --> 17:20.680
signatures are only like taking I don't know like 20 lines or anything like that then also with a

17:20.680 --> 17:27.720
help of LLM or any other such solution you could try to extract the error signature

17:28.280 --> 17:35.400
and then embed only this signature instead of a whole log file. Then we have fine tuning.

17:36.520 --> 17:44.680
So by default the models that are pre-trained are like pre-trained on like a lot of big data sets

17:44.680 --> 17:53.080
that are like generic and generalling in all. So we found out that there are some

17:53.800 --> 18:00.360
issues with domain specific patterns. So for example we had a test that compared to vectors

18:00.360 --> 18:09.480
and reported the percentage of incorrect values and for example if we had like in the same test case

18:09.480 --> 18:17.640
2% is 2% of difference and then we had another run on another build that resulted in 80% of

18:18.600 --> 18:28.040
difference by default it was treated as a the same error message and it was aggregated because

18:28.040 --> 18:35.160
like LLM didn't know that the such issues should be treated as a separate ones. In such

18:35.160 --> 18:42.120
case we can create a small data set that will like optimize the behavior of the encoder module

18:42.760 --> 18:52.680
that will force him to separate this domain specific patterns. Finally over the talk I

18:54.280 --> 19:02.520
presented you the failure aggregation part there are which focuses on linking the failures that

19:02.520 --> 19:09.080
look the same so we have the same error message. There is also an option for failure correlation

19:09.080 --> 19:16.280
and by that I mean to link failures that look different however have the same root cause.

19:16.280 --> 19:22.840
So if we have some unit integration tests that we know from the like statistical data

19:22.840 --> 19:29.800
that will in most cases fail together with some end-to-end test then we can automatically try to

19:29.800 --> 19:37.800
also correlate these failures and present it as one in a one view for further processing.

19:39.560 --> 19:48.280
So to sum up with text embedding for failure aggregation you can improve efficiency of your

19:48.280 --> 19:54.920
process by minimizing noise and I really encourage you to give it a try as the entry barrier

19:54.920 --> 20:02.200
is really really low all of the packages are open source there are free trade models available online

20:02.840 --> 20:10.680
and you can start running them even on a CPU. So the key takeaway is that text embeddings

20:10.680 --> 20:15.000
is a low effort way to turn CI noise into signal. Thank you.

20:22.280 --> 20:24.200
Now I'm open for questions.

20:32.600 --> 20:52.200
We didn't benchmark that but like the main issue is the scale so we've like hundreds of failures

20:52.200 --> 20:58.520
or like thousands of failures here we didn't want to like go like with this option

20:59.160 --> 21:03.720
having the the infrastructure that we already have at the end-to-end.

21:14.520 --> 21:19.000
Yeah so the question is about the false positives and negative so we measured the

21:19.640 --> 21:28.680
Rico as I remember so we wanted to have like no false positives. We wanted to have like multiple

21:28.680 --> 21:37.000
back reports instead of one that correlated or aggregated the wrong issue with itself and on

21:37.000 --> 21:46.600
hours like testing data set we had like a 90% so it was really really like good from the get we didn't

21:46.600 --> 21:48.600
had to like optimize it too much.

21:48.600 --> 22:18.520
So the question is about the stability of the

22:18.520 --> 22:28.520
tests and how they are aggregated so in such case we are not aggregating over the like single

22:28.520 --> 22:35.160
build run we are aggregating across the like multiple builds so if you have sporadic failures

22:35.160 --> 22:40.680
that are like occurring once a week or something like that this solution will automatically

22:40.680 --> 22:48.680
detect them yeah because we already have them in in a database.

22:58.680 --> 23:09.480
Yeah so the question is about the get-by-sext and whether we try to automatically detect

23:09.480 --> 23:17.480
whether we should update the test or fix the solution. So now we have haven't done it there are

23:17.480 --> 23:26.440
there were some work where we get the comments that could introduce the regression from multiple

23:26.440 --> 23:33.960
components and then try to use an LLM to ask him a question which from these comments is most

23:33.960 --> 23:40.920
highly related to the fail that we see in the system yeah so there were some work but it was not

23:40.920 --> 23:45.960
part of this solution.