WEBVTT

00:00.000 --> 00:08.640
Hello everyone, thanks for having me here.

00:08.640 --> 00:14.120
Today we're going to talk about retrieval augmented generation or rack where it works,

00:14.120 --> 00:18.760
where it starts breaking and whether knowledge graphs can help us with that.

00:18.760 --> 00:23.520
A couple of words about myself, my name is Miki Tekimarski, I'm a technical

00:23.520 --> 00:30.560
co-founder of a startup where we do AI agents for autonomous negotiations in procurement.

00:30.560 --> 00:36.480
So I've had a chance to explore different LLM-based architectures, including rack and

00:36.480 --> 00:39.080
these extensions.

00:39.080 --> 00:40.680
Let's imagine a common scenario.

00:40.680 --> 00:46.760
You have a large knowledge base of unstructured documents, for instance, company policies.

00:46.760 --> 00:52.320
You would like to ask questions and get answers based on these documents.

00:52.320 --> 00:57.880
You, as a solution, you configured a rack pipeline on top of that.

00:57.880 --> 00:59.880
First, everything looks great.

00:59.880 --> 01:10.800
You ask simple questions, get valid, valid responses, but then questions become more complex.

01:10.800 --> 01:15.700
Questions that require connecting multiple documents, questions about contradictions, or

01:15.700 --> 01:19.480
about how something evolved over time.

01:19.480 --> 01:22.280
And suddenly, rack can not address that.

01:22.280 --> 01:26.000
This is what leads us to graph rack.

01:26.000 --> 01:32.280
But before talking about limitations, let's first align on what rack means at all.

01:32.280 --> 01:38.280
So I'll call the default implementation venue rack in order to distinguish it from its

01:38.280 --> 01:39.520
extensions.

01:39.520 --> 01:40.840
So the idea is simple.

01:40.840 --> 01:44.960
You have a large unstructured knowledge base.

01:44.960 --> 01:49.280
You want a large language model to reason on top of that, but you can not pass all the

01:49.280 --> 01:55.600
knowledge base just because all the lamps context window is limited.

01:55.600 --> 02:04.360
So you decide to split that text into chunks and then on each question, you retrieve

02:04.360 --> 02:10.800
top-key most similar chunks and pass them only to the large language model to serve as

02:10.800 --> 02:16.240
a context to answer your question.

02:16.240 --> 02:23.680
So for many cases, this approach works, but that's why it became so popular.

02:23.680 --> 02:27.040
But the problems start when questions stop being local.

02:27.040 --> 02:32.240
For example, questions that require combining information from multiple documents.

02:32.240 --> 02:37.280
Questions that depend on relationships between entities, global queries, like what are

02:37.280 --> 02:38.920
the main themes in this data?

02:38.920 --> 02:44.680
Obviously, it requires high-level view on the whole data set, not separate chunks.

02:44.680 --> 02:49.000
Just dealing with trying to spot contradictions, for instance.

02:49.000 --> 02:56.200
You might ask which documents disagree or contradict to each other, or questions related

02:56.200 --> 02:58.360
to temporal reasoning.

02:58.360 --> 03:06.840
Again, chunks of data, you do not capture what time they are relevant at or maybe invalid

03:06.840 --> 03:09.120
at.

03:09.120 --> 03:11.360
This is where knowledge graphs come in.

03:11.360 --> 03:17.360
GraphRack is an umbrella term for all these solutions that try to augment vanilla

03:17.360 --> 03:19.840
rack with knowledge graphs.

03:19.840 --> 03:25.120
So the idea is simple, maybe if we add structure, entities, relationships, maybe cluster,

03:25.120 --> 03:28.520
these entities generate summaries of those clusters.

03:28.520 --> 03:37.640
We can recover the reason in that vanilla rack lacks.

03:37.640 --> 03:42.040
Actually, GraphRack follows the pipeline similar to vanilla rack, but with much more

03:42.040 --> 03:43.960
going on under the hood.

03:43.960 --> 03:49.120
We can roughly split it into three stages, building a knowledge graphs from your unstructured

03:49.120 --> 03:54.960
data, typically text, retrieving relevant subgraphs to be used as a context for answering

03:54.960 --> 03:59.880
your question, and the final question generation using that context.

03:59.880 --> 04:03.680
From that outside, it still looks like ask a question, get an answer.

04:03.720 --> 04:08.640
However, internally, it's much more complex and more expensive, obviously.

04:08.640 --> 04:11.680
By the way, there is an open catalogue of GraphRack patterns.

04:11.680 --> 04:16.640
You may be interested to take a look at it.

04:16.640 --> 04:22.600
Let's consider some of the main open source GraphRack solutions available today.

04:22.600 --> 04:23.880
It's not an exhaustive list.

04:23.880 --> 04:31.840
I have picked some active lamenting solutions, which focus on different query types, Microsoft

04:31.880 --> 04:37.520
GraphRack and lemma index, both cover the whole rack pipeline from the building of the

04:37.520 --> 04:43.200
knowledge graph up to the generation of the answer, while graphity covers only knowledge

04:43.200 --> 04:50.400
graph building and context retrieval, leaving the responsibility of setting up LLM reasoning

04:50.400 --> 04:55.440
on top of that context and generating an answer to you.

04:55.440 --> 04:59.040
I picked two text corpora for testing.

04:59.040 --> 05:08.600
Christmas Carol as a static knowledge base and some toy data set with synthetic events to

05:08.600 --> 05:14.960
serve as a evolving knowledge base.

05:14.960 --> 05:17.640
Let's start with Microsoft GraphRack.

05:17.640 --> 05:23.320
The main promise here is an ability to answer global queries like what are the main topics

05:23.320 --> 05:25.360
in this data.

05:25.360 --> 05:31.120
It works by extracting arbitrary entities and relationships and clustering them into hierarchical

05:31.120 --> 05:35.080
communities, generating a summary of each community.

05:35.080 --> 05:41.360
It creates multiple abstraction layers, like even summaries of summaries, which allows you

05:41.360 --> 05:48.360
which allows the system to get a high level view of the whole data set or separate clusters.

05:48.360 --> 05:51.200
And thus, answer this global queries.

05:51.200 --> 05:55.840
However, it comes with some significant constraints.

05:55.840 --> 05:59.080
First of all, you cannot enforce an ontology here.

05:59.080 --> 06:04.680
In other words, entity and relationship types you would like to use.

06:04.680 --> 06:06.920
It's arbitrary in GraphRack.

06:06.920 --> 06:08.960
In incremental updates are not supported.

06:08.960 --> 06:14.640
So as new data arrives, you would need to reindex and it's very expensive in GraphRack.

06:14.640 --> 06:17.440
We'll see that later.

06:17.440 --> 06:22.880
That also, it's worth mentioning that tracing exact source documents that contributed

06:22.880 --> 06:30.200
to the answer is also difficult here, because the answer is generated out of the summaries

06:30.200 --> 06:36.240
that comprise a lot of different documents or even summaries of summaries, as I said.

06:36.240 --> 06:38.880
In addition to that, it lacks some operational features.

06:38.880 --> 06:48.280
For instance, you cannot configure persistence layer indexes stored as pocket files locally.

06:48.280 --> 06:56.440
So I indexed bold Christmas Carol on both Microsoft GraphRack and Vanilla Rack.

06:56.440 --> 07:03.720
And even on that relatively small data set, GraphRack consumed a lot of inference tokens

07:03.760 --> 07:11.040
for indexing, while obviously Vanilla Rack requires no inference on the embedding the chunks.

07:11.040 --> 07:18.880
And took some significant time and more importantly queries.

07:18.880 --> 07:23.520
So they consume much more tokens as well and take a lot of time as well.

07:23.520 --> 07:29.200
So the local query was who participated in the family dinner and the global query was what's

07:29.200 --> 07:33.000
the central lesson of the story.

07:33.000 --> 07:40.000
And again, if you new data arrives, you need to index and that's, again, indexing costs,

07:40.000 --> 07:41.440
you spend the same tokens.

07:41.440 --> 07:48.280
They use some cash and inside, but some steps, but community summaries should be recomputed

07:48.280 --> 07:56.120
and that spends the most of the tokens and the most of the time.

07:56.120 --> 08:02.360
Also an important part query types and GraphRack are explicit and you're supposed to choose

08:02.400 --> 08:04.240
them manually.

08:04.240 --> 08:09.840
Of course, you can try different query types for your question and compare the results.

08:09.840 --> 08:13.720
Local and global are two original query types.

08:13.720 --> 08:16.160
Drift, they added that later.

08:16.160 --> 08:18.840
I haven't experimented much with that.

08:18.840 --> 08:22.320
It's some expanded version of the local query.

08:22.320 --> 08:30.400
And basic is the Vanilla Rack just for your convenience to compare the results.

08:30.400 --> 08:37.120
I mentioned the provenance track and so tracing the documents, specific documents that contribute

08:37.120 --> 08:39.640
to your to the answer.

08:39.640 --> 08:47.200
I just wanted to show how GraphRack returns that there's a lie returns it embedded in text

08:47.200 --> 08:48.800
answer.

08:48.800 --> 08:54.160
You can see it's indexes of the entities in packet files.

08:54.160 --> 08:59.920
Obviously it requires parsing here, but they're Python API returns it as structured.

08:59.920 --> 09:10.440
Let's move on to the next solution, which is Lama Index.

09:10.440 --> 09:13.520
It's scaffolding library for Rack pipelines.

09:13.520 --> 09:21.960
It includes interfaces and different implementations for data ingestion, indexing, retrieval

09:21.960 --> 09:29.600
and persistence layers, layer and among other index implementations they have

09:29.600 --> 09:35.520
property graph index, which is another graph Rack implementation.

09:35.520 --> 09:40.800
It's highly customizable supports different query approaches and allows you to enforce specific

09:40.800 --> 09:43.920
entity and query types.

09:43.920 --> 09:50.120
Therefore it's not designed specifically for some particular query type, but it's supposed

09:50.120 --> 09:53.200
to be used as a configurable solution.

09:53.200 --> 09:55.160
And it's mainly a Python library.

09:55.200 --> 10:01.560
They have a TypeScript version as well, but it's limited, for instance, property graph index

10:01.560 --> 10:06.320
is absent there.

10:06.320 --> 10:13.880
I did the same comparison with Lama Index, it consumed much less resources than GraphRack,

10:13.880 --> 10:17.920
but still much more than vanilla Rack.

10:17.920 --> 10:26.400
But as I said, it's configurable, you may come up with different results, so the advantage

10:26.400 --> 10:36.640
here is more control, you configure what to extract, how to build relationships, etc.

10:36.640 --> 10:39.720
Another interesting solution is GraphIti.

10:39.720 --> 10:43.680
It focuses on dynamic and temporally aware knowledge graphs.

10:43.680 --> 10:46.160
What does temporally aware mean here?

10:46.160 --> 10:54.400
Peter deals with events, they call it episodes, having an occurring time for each event.

10:54.400 --> 11:01.440
It allows them to use that occurring time to answer temporal questions, and as new data

11:01.440 --> 11:09.600
comes in, instead of overwriting or simply overwriting old facts, it rather invalidates them

11:09.600 --> 11:14.360
with time stamp to preserve that historical accuracy.

11:14.360 --> 11:20.920
That allows you, potentially, to answer your questions, such as what was true at some

11:20.920 --> 11:32.160
point of time in the past, or what changed between that point of time and now.

11:32.160 --> 11:38.120
Testing GraphIti on a static data set like a Christmas Carol would be meaningless to properly

11:38.120 --> 11:44.600
evaluate temporal reasoning, we would need some data set of events that evolve over time,

11:44.600 --> 11:50.600
where facts change, get invalidated, and unfortunately, I didn't have enough time to prepare

11:50.600 --> 11:54.200
a fully representative data set of that kind.

11:54.200 --> 12:01.800
Instead, I just for proof of concept, I created a small data set of events, describing a fictional

12:01.800 --> 12:06.520
employee's career path.

12:06.520 --> 12:12.920
However, even with this simple data set, the default graffiti setup did not produce correct results.

12:12.920 --> 12:16.760
For example, when I asked who is the employee's current manager,

12:16.760 --> 12:25.640
graffiti return to relationship, it considered current, even though, according to the event history,

12:25.640 --> 12:31.240
the employee had already been promoted and reports to a different manager.

12:31.240 --> 12:33.800
However, it's a default setup.

12:33.800 --> 12:39.880
I did not use any ontology, did not enforce specific entity relationship types.

12:39.880 --> 12:49.320
It used some small large language models, so I assume the quality issues stem from that,

12:49.320 --> 12:52.280
and therefore there is room for experiment and improvement.

12:56.600 --> 12:59.480
The main risk here also is performance.

13:00.360 --> 13:03.880
Every new event triggers an LLM-driven knowledge graph of data.

13:05.240 --> 13:11.000
Even on small events from my toy dataset, an average ingestion for one event,

13:11.000 --> 13:18.680
was took 15 seconds, which is a lot, and given the use case, they describe as the main one,

13:18.680 --> 13:24.520
some real time events, it looks like it's feasible only for

13:24.840 --> 13:28.280
slowly changing data.

13:28.280 --> 13:35.720
But if you have some high volume event streams, then it's probably not feasible.

13:35.720 --> 13:42.040
So you would need too much inference resources on that, and latency is crazy.

13:43.800 --> 13:50.680
And query time, it's about exclusion in your reasoning, so it's just about retrieving

13:50.680 --> 13:54.440
the relevant nodes or entities or subgraph.

13:54.440 --> 14:00.840
You can configure that, graffiti allows some search configuration, including real ranking methods.

14:00.840 --> 14:06.680
And that's just what they claim I received on that toy dataset, similar results,

14:06.680 --> 14:10.040
but it may be not representative due to the size of the dataset.

14:12.920 --> 14:20.120
And at this point, and all these questions arises, how do we actually compare these systems?

14:20.120 --> 14:23.480
So cost and latency are easy to compare, but what about

14:25.000 --> 14:30.680
answer quality with different query types on different datasets?

14:30.680 --> 14:34.680
And in terms of how there is still lack of adopted benchmarks,

14:34.680 --> 14:40.360
some benchmarks have recently been released, but haven't gained an adoption yet,

14:40.360 --> 14:47.480
like graph-rack bench, where they do, where they compare different methods with quantitative

14:47.560 --> 14:55.240
metrics on different query types, exactly for, I listed them, or Microsoft benchmark QED,

14:55.240 --> 15:00.680
where they do pair-wise comparison, focusing mostly on local versus global queries.

15:02.200 --> 15:09.240
But the point is that in practice, you would need to set up your own evaluations on your own data,

15:09.240 --> 15:14.040
and queries you care about to understand the actual performance.

15:15.000 --> 15:22.360
There are also other open-source graph-rack solutions we haven't considered today that apply

15:22.360 --> 15:26.840
different techniques. You may be interested to look at it as well.

15:30.200 --> 15:35.800
So what can be conclude? Vanilla Rack is simple cheap and effective in some cases,

15:35.800 --> 15:42.760
until reasoning becomes structural or global. Graph-rack approaches genuinely extend what's possible,

15:42.760 --> 15:47.080
but introduce significant costs, complexity, and operational risk.

15:47.880 --> 15:52.280
There is no universally best solution, and without standardized benchmarks,

15:52.280 --> 15:58.760
adopting graph-rack blindly is dangerous. And the only reliable path today is evaluation on

15:58.760 --> 16:01.720
your own data with your own questions.

16:04.840 --> 16:10.520
Thank you for your attention. If you are experimenting with Rack, it's extensions,

16:10.680 --> 16:16.920
please send me a message. I'll be happy to answer your questions, so participate in the discussion.

16:25.960 --> 16:27.960
Here you go for the question.

16:28.920 --> 16:45.080
What's Rack's solution? The question is, what Rack's solutions do I apply at my company in my project?

16:45.080 --> 16:52.120
So currently, we do not use Rack for our current solution, but as a start-up, we iterated with

16:52.600 --> 17:01.000
different ideas, and one of our products was an anti-corruption copilot, which was intended to spot

17:01.000 --> 17:12.920
corruption risks in governmental bills. And there we applied Lama Index to design a graph with clauses

17:12.920 --> 17:20.680
and connections between these clauses to enable, for instance, you have references to another

17:20.680 --> 17:28.360
clause, and if you deal with chunks only, there is some reference like in clause 5.1,

17:28.360 --> 17:34.520
it's stated that something, and you do not know what is in that clause. And a knowledge graph

17:34.520 --> 17:38.760
allowed us to model that relationships between clauses and sections of the document.

17:40.440 --> 17:46.600
But unfortunately, that project was decommissioned due to finding, like a funding, and now we work

17:46.600 --> 17:49.960
on autonomous negotiations, there is no Rack at least now.

17:59.080 --> 18:03.240
The question is, embedding, say, used, which data type and which dimensionality?

18:03.800 --> 18:11.400
Dimensionality. Okay, it was, it's not open source, it's open AI text embedding small.

18:12.040 --> 18:18.760
And dimensionality is the default one, if I'm not mistaken, 148 dimensions. That one.

18:19.400 --> 18:30.200
And comparison, like you saw the numbers were rather rough, but I did not mean that to make it

18:30.200 --> 18:35.560
representative to some millisecond accuracy, I wanted just to bring some sense of what to expect

18:35.560 --> 18:44.280
at all from these solutions. I tried to use similar large language models, GPT5 Nano.

18:44.280 --> 18:52.120
It's not a commercial element, the quality is another. But yeah, just to make it representative

18:52.120 --> 18:56.920
of course, so that the same encoders used and the same inference speed approximately.

19:05.560 --> 19:28.360
So the question is, how to build the graph effectively? Because I experimented at the

19:29.000 --> 19:37.320
how to build the graph effectively. Because I experimented at the, for example, more chunk

19:37.320 --> 19:45.800
are better than bigger one. But what solution have you used? Microsoft Graph Rack, or you

19:46.520 --> 19:55.560
applied some different? On static knowledge base. Unfortunately, I haven't performed

19:55.640 --> 20:02.840
too much experiments on that. So I wanted to see them rough performance. But I haven't applied

20:02.840 --> 20:09.720
Microsoft Graph Rack in production. I just haven't used a use case for that. But what we

20:09.720 --> 20:17.160
lean towards, I don't know if it's correct, but we rather use long context Windows

20:17.400 --> 20:27.640
to add large models to capture more rather than chunking it to on small chunks. In just

20:27.640 --> 20:37.400
in order to capture more context, when you start chunking, when you do more, come up with smaller chunks,

20:38.200 --> 20:45.480
you kind of lose that context. So the bigger the chunks, the more the accuracy of the extraction.

20:45.560 --> 20:50.600
But again, it's more expensive since you have larger, you should use larger model, which is

20:50.600 --> 20:56.920
pretty much more. The question part of the user, like, I don't know, these terms between embeds or similar.

20:59.400 --> 21:08.360
Graph Rack is a, let's say, it's not configurable. So you pass the unstructured knowledge base,

21:08.440 --> 21:18.760
unstructured document text, text, especially. And it does everything under the hood. So yeah,

21:18.760 --> 21:23.960
could you repeat the question? Sorry. Sorry, because I also implemented a custom

21:24.280 --> 21:35.160
graph without any framework. And I, to cluster the, to make a cluster, the topics I use as a

21:35.160 --> 21:44.760
semantic, maybe you use something similar, maybe not. Not really, but graph Rack, they apply

21:44.840 --> 21:55.880
a latent, latent cluster in the king. So maybe you can cluster not based on semantics, but rather

21:55.880 --> 22:05.320
on distance on the graph. I mean connectivity between nodes. Graph Rack, if I'm not mistaken,

22:05.320 --> 22:14.600
it uses the same approach, so maybe it would be more effective. Yeah, exactly. But,

22:14.600 --> 22:24.920
you get some recent top of clusters. Thank you very much. Thank you.

