WEBVTT

00:00.000 --> 00:13.360
Clickhouse and the reason why we implemented or re-implemented and what was the use cases for us

00:13.360 --> 00:17.920
that motivated us to implement the inverted index in the first place.

00:17.920 --> 00:23.480
So before the inverted index, maybe some of you don't know clickhouse, maybe small introduction

00:23.480 --> 00:24.480
about it.

00:24.480 --> 00:30.600
So it is open source database, and I hope it will be our favorite database soon.

00:30.600 --> 00:36.160
And the development of clickhouse started in 2009 in Yandex and then it went to production

00:36.160 --> 00:38.080
in 2012.

00:38.080 --> 00:45.280
Then since 2016 it became open source, so it is mostly used for column oriented data storage

00:45.280 --> 00:47.720
so best for aggregations.

00:47.720 --> 00:51.680
So it means that we have a different files for each column.

00:51.680 --> 00:57.600
So yeah, there are also some sorting and indexes and aggregation, first aggregation implementations.

00:57.600 --> 01:03.040
It's also distributed, so it's replicated, started, moved to the master and cross region.

01:03.040 --> 01:06.240
I think I hope you are familiar with the terms.

01:06.240 --> 01:13.120
Then yeah, it's mostly the allot database, so mostly it's for analytical use cases, aggregations

01:13.120 --> 01:15.120
and so on.

01:15.120 --> 01:21.640
Okay, so we can start with the full text search, the journey that we started.

01:21.640 --> 01:28.360
At clickhouse, so actually the inverted index in clickhouse is not new, so the main development

01:28.360 --> 01:37.720
or experimental development started in 2022 and then in the late 2022 and early 2022

01:37.720 --> 01:42.920
and three, it became or released as an experimental feature.

01:42.920 --> 01:50.160
Then in around February last year we started re-implementation, so I will talk about why we

01:50.240 --> 01:59.200
started re-implement the main reason and then September 2025 we released a private preview

01:59.200 --> 02:05.520
for our cloud customers but it also opensource in the version and then the better version

02:05.520 --> 02:10.000
is just released last month in December.

02:10.000 --> 02:15.520
So the use case, what's the problem that we are trying to solve?

02:15.520 --> 02:24.080
Let's say that we have some observable to table, that has a very large rows, like we have

02:24.080 --> 02:30.880
two billions and some rows in between and you want to search a specific term in this

02:32.720 --> 02:40.720
highlighted row, so you want to search a very something that's term in this very large table.

02:40.800 --> 02:47.040
What's the query? So we have the simple query, we have a select that from logs and where

02:47.040 --> 02:52.800
has contains all tokens and you specify some tokens that we are looking at.

02:53.600 --> 03:00.240
So what's the problem with this query? The problem is it is very small because it's a full

03:00.240 --> 03:07.920
table scan, so you have to iterate all rows in your table and then look for the tokens and it

03:08.240 --> 03:16.720
instantly becomes very small. So before the solution, I want to give some fundamental

03:18.800 --> 03:24.720
knowledge about some parts of the clickhouse, so the granule, what's the granule?

03:24.720 --> 03:30.400
Granule is the representative smallest individual data unit in process by clickhouse,

03:31.280 --> 03:45.200
so by default, we take 8,1292 rows in a single granule, and then we also have a special

03:46.240 --> 03:52.080
index called skipping decks, whenever you skip index instead of skipping the rows, it just

03:52.080 --> 04:00.240
keeps the granules. So afterwards you have to double check your data really exists in this granule

04:00.240 --> 04:08.480
or in each row's in granule or not, and this is also this can be a bottleneck for some index

04:09.760 --> 04:17.680
slackties and but we need a very fast solution that we don't we just skip the unnecessary rows

04:18.000 --> 04:26.720
at maximum level. So another reason or motivation was the tokenizers, so let's take the

04:26.720 --> 04:34.080
simple query that has some timestamps, some to keep it short, I didn't really put a error message,

04:34.080 --> 04:39.680
but some yet this is the second part in the curly brackets is query ID, then we have some

04:40.480 --> 04:47.440
log level, and if you use a different tokenizers like split by non-alpha, so it will, yeah,

04:47.440 --> 04:54.720
it will split like that, it will instead of taking the date column or the date value is as a

04:54.720 --> 05:01.200
single token, it will also split in the two years, months and days, so there's a split by string,

05:01.200 --> 05:09.760
so you can define what's your split points, so you can split that. If you use that, you can

05:09.760 --> 05:17.760
definitely split by dates or date and time separately, also to keep the query. In the first one,

05:17.760 --> 05:24.000
you can also see there's a dash between the in the query ID, so that's why it takes the

05:24.080 --> 05:31.840
it also splits that, and also that you can use ingrants by default, I will three, so this is

05:31.840 --> 05:40.400
how we, the ingram process the token or the text, there's also array tokenizer, but that was

05:40.400 --> 05:47.280
standard demanded by some people, instead of tokenizing it, it does basically nothing, it just

05:47.280 --> 05:56.160
gives the whole text as a single token, okay, and for all these motivations, we need the new

05:56.160 --> 06:04.720
solution, what we call a new full text search, and I will explain its building blocks, so let's start

06:04.720 --> 06:12.240
with the syntax first, how it is different to the other index, so first, you create a table,

06:12.240 --> 06:20.160
you define your columns, and at end you either, you say, I want to have an index, and the index

06:20.160 --> 06:29.200
name and column name, then you specify the type, it's text, now we have a key value pair here,

06:29.200 --> 06:36.000
so you just define your tokenizer, you can use a different tokenizer split by an alpha split by string,

06:36.000 --> 06:42.800
ingrams, sparscrams, and array, and yeah, this is how you can define your text index,

06:42.800 --> 06:48.720
the new text index and clickles, and currently we support this five tokenizers,

06:49.920 --> 06:57.280
that we didn't support in the first version, then the new also the, to use the,

06:58.640 --> 07:03.920
to use an text index, we introduced two new functions called, has any tokens and has all tokens,

07:04.640 --> 07:15.600
so basically it is just search for, and the search stickers in your column data,

07:15.600 --> 07:28.320
so the difference is that, yeah, you can either search for any tokens, it means that if the clickhouse

07:28.320 --> 07:35.680
or first them appears in your row, it will return that, or you can use has all tokens, it means that

07:35.680 --> 07:44.880
both terms need to occur in your row data, okay, so how we implement the full text search,

07:44.880 --> 07:53.920
so it is based on the inverted index, so what's the inverted index, it's the inverted index,

07:53.920 --> 08:09.760
basically I'm mapping to, okay, it's basically mapping to documents, so in the picture in the

08:09.760 --> 08:17.360
left side you can see there's an ID and the content, and the tokens from this content is the

08:17.360 --> 08:25.840
quick-prone folks, and we build the, the inverted index just maps to the, okay, let's say the

08:25.840 --> 08:32.320
problem, token appears in the first and the third document, and we just keep that record,

08:33.760 --> 08:44.960
then also that we, how the data or index data is organized, we build a, or we maintain a

08:44.960 --> 08:50.880
hash table while processing on memory, which is, yeah, this is the picture on the right side,

08:51.440 --> 08:58.240
but once we need to store this hash map to the disk, we split in those three different files,

08:58.800 --> 09:07.280
so it's there next to the column data, let's say text.bin, binary file, so we have a smart index,

09:08.560 --> 09:14.080
we have a dictionary file, and we have a postings, so I will explain what are they,

09:14.080 --> 09:20.720
so just a quick comment, the postings list is just, we are currently using a bitmap,

09:21.280 --> 09:28.320
and it's because it's very smaller, and it's very efficient to process on memory, okay,

09:28.320 --> 09:36.000
let's start with how we build this sparse index and dictionary blocks, if you can see that in this

09:36.880 --> 09:44.480
hash table, the tokens are not sorted, so it means that we first need to sort all these tokens,

09:44.480 --> 09:53.120
so the first process or first step is sorting all tokens, and afterwards we split in the tokens,

09:53.120 --> 09:59.680
or we sort the tokens, then we split them into what we call a dictionary blocks, and the dictionary

09:59.680 --> 10:08.640
blocks are separated by, yeah, I think by default we are using 128 tokens in a single dictionary

10:08.640 --> 10:17.520
block, and we also use a compression called front coding, so on the left side it's a road token,

10:17.520 --> 10:26.080
on the right side we can see we put the sum numbers and characters which is for front coding

10:26.160 --> 10:34.880
compression, afterwards before that we also have a each token points to its postings list,

10:34.880 --> 10:44.080
and we just keep that again, and afterwards we build what we call the sparse index, sparse index

10:44.080 --> 10:52.800
is basically the first points of each or first token of each block, and we use this for finding

10:53.520 --> 11:05.600
it, first data, okay, yeah, okay, so I will have some demo to short it,

11:06.560 --> 11:24.560
okay, hello, okay, this button, I think it's, I will just run the clickhouse,

11:25.760 --> 11:32.240
so that's run there, it's trying to go my laptop, so no tricks does that clickhouse client,

11:33.120 --> 11:39.440
okay, I will use I have a demo, then to use the new text index, we have a called enable full text

11:39.440 --> 11:48.880
index feature, so you have to enable that, then I have three, actually I will use only the first

11:48.960 --> 12:11.200
one, so there's a Hector new status set, and we can see how many, okay,

12:11.200 --> 12:20.880
yeah, they are around the 30 million entries in this one, and if I just want to search

12:22.080 --> 12:30.080
any token from that, let's say I have a token, text, clickhouse, and I will set some settings,

12:30.080 --> 12:42.800
let me show that very condition page, and so in this one I can keep that, so in this first query,

12:42.800 --> 12:48.400
I will disable the skipping this, and I will use disable the query condition cache,

12:48.400 --> 12:55.440
make sure that it doesn't use any cache for condition, so it takes some time, so

12:55.440 --> 13:04.400
for this simple query, it took around four seconds to scan around 30 GB of data,

13:05.120 --> 13:13.120
and it's a scanned whole dataset to answer the query, the interesting thing is what happens if you

13:13.680 --> 13:21.280
if we enable the skipping decks, which I have a text index on it, so you can see that we can

13:21.280 --> 13:27.840
answer this query in seven milliseconds, instead of four seconds, and the interesting part is

13:27.840 --> 13:34.480
we don't need to scan the whole dataset, and we just need to read one point four megabytes

13:34.480 --> 13:42.880
to answer this query, and also the peak memory usage is very low, okay,

13:43.520 --> 13:52.800
now I can continue that, so, okay now the question is what happened, so how we are looking

13:52.800 --> 14:01.360
this term in our in the text index, we have a we need first need to scan our sparse index,

14:01.360 --> 14:08.240
and since the sparse index is contains the first entries from each block and since they are

14:08.240 --> 14:12.880
sorted, we can simply use a binary search, we don't need to scan the whole sparse index,

14:12.880 --> 14:20.080
we just simply use binary search to find the block index, we have a blocks index one for this

14:20.080 --> 14:27.360
click-offs, we found the blocks index one, after do that, we get the block index dictionary block

14:27.360 --> 14:35.760
one, and in this block we have a different tokens, and now we look the click-offs really in this

14:35.760 --> 14:44.320
dictionary block, and if this present we need to read it's post-naclist, and yeah if it's

14:44.320 --> 14:53.360
present, we simply just need the post-naclist that contains some integer seconds that, yeah,

14:53.360 --> 15:02.160
shows, okay, if this token contains in this dictionary document IDs, and afterwards what we are doing

15:02.160 --> 15:12.080
is very interesting is instead of using this raw or document IDs, correctly we build a

15:12.080 --> 15:20.160
what we call a virtual boolean column, so basically we create a very large boolean column with the size

15:20.160 --> 15:28.400
of the column itself, and by default they are zero, and when we find the took, or we fill this, we set

15:28.400 --> 15:36.880
once to the in this virtual column with the raw IDs that we found in our post-naclist, and

15:36.880 --> 15:45.120
afterwards we there's a special optimization, text index optimization, so instead of using the

15:45.120 --> 15:54.320
original filter column, we create this virtual column on background, and we just basically remove

15:54.320 --> 16:01.920
the original filter column and use this virtual column, and this way we really don't need to execute

16:01.920 --> 16:10.640
the original filter column anymore, and this makes it very fast, and this is what happened when I

16:11.440 --> 16:20.800
use this when I use in demo, so, okay, so just putting it all together, so this is the whole

16:21.200 --> 16:28.240
mechanism, so you first we look for a term, we search in a sparse index, then we retrieve the

16:28.240 --> 16:34.560
dictionary block, then if to present in a dictionary block we retrieve the post-naclist and we

16:34.560 --> 16:42.720
create our virtual boolean column, and we use this virtual boolean column just for filtering the data,

16:43.280 --> 16:50.960
okay, this is the summary, and if you say I liked it, you can try it yourself, it's in the open source

16:50.960 --> 17:04.960
click-all since version, actually the version is already enabled by in the 20.5.10, but the better version

17:04.960 --> 17:13.200
is released in 25.12, and we also offer a private preview for our click-all's cloud customers,

17:14.160 --> 17:19.520
and also the more features are coming, so we are planning to implement the phase search, also

17:19.520 --> 17:28.000
cause aka known as a positional queries, and in also this year we are also planning to add the

17:28.080 --> 17:37.840
squarings to the text index, okay, and this is the talk, thank you so much, and if you have

17:37.840 --> 17:50.880
a connection, okay, just one last slide, so we have a designer, we have a community dinner,

17:50.880 --> 17:56.400
if you want to join, just scan the QR code and this evening in Brussels we will have a

17:56.400 --> 18:12.640
community dinner, thanks. Hi, it just wanted to ask you, when you had the tokens,

18:12.640 --> 18:28.640
how did you go about creating that, and I feel, for instance, like, we have a lot of things,

18:28.880 --> 18:33.200
so as you are following the page, you will happen that you are at the version of coming out,

18:33.200 --> 18:42.000
just in case you can look at them, yeah, and if you analyze it, if you analyze it, except the

18:42.000 --> 18:49.440
question mark, then you will have in your sparse index, but if you don't have, you can, it's very,

18:49.440 --> 18:56.080
it's very flexible, so you can just keep on it, how you want to analyze your data, so you define it,

18:56.080 --> 19:00.640
if there's a exclamation mark, we will either use us to move it, or you want to see it.

19:00.640 --> 19:05.520
There's something in the text, if there's something in the text, we need to indicate,

19:05.520 --> 19:13.120
you can use the root for the world, yes, yes, so I have, like, for example, I will click out,

19:13.120 --> 19:19.040
just click with the root, I will be the top zero, and then I will be the one, the first,

19:19.040 --> 19:25.200
you mean the in the sparse index, yes, sparse index, we just basically use a binary search

19:25.200 --> 19:31.040
with upper one, or this means that after this point, this token cannot occur there,

19:31.040 --> 19:36.240
because it's there, sort, it can be only appeared before, but how do you decide which,

19:36.240 --> 19:42.800
like, which token is good enough for when you are a index here, when you have a different block,

19:42.800 --> 19:48.240
the first block was clicked and then you have to take out the data, and say, how do you define

19:48.480 --> 19:55.760
what's in the data, it is an analysis, so, yeah, no, we are just using a sort, we just, we sort them,

19:55.760 --> 20:04.720
or we take it, and then it's the first one in my case, okay, so it's really beneficial if you have a lot of