WEBVTT

00:00.000 --> 00:08.000
All right folks, we are ready to get started again.

00:08.000 --> 00:12.000
Carol Chen is here to present, get your docs in a row with ducklings.

00:12.000 --> 00:14.000
Let's give it a hand to Carol.

00:14.000 --> 00:15.000
Thank you, Daniel.

00:15.000 --> 00:17.000
Hi everyone.

00:17.000 --> 00:22.000
Thanks for being here today.

00:22.000 --> 00:26.000
I think this is my 10th awesome perhaps.

00:26.000 --> 00:30.000
I've been coming almost year or less since 2013.

00:30.000 --> 00:37.000
But always learning something new from people and new development in projects.

00:37.000 --> 00:40.000
And new faces, old faces.

00:40.000 --> 00:44.000
A lot of people, I'm based in Finland, a lot of people who live in Finland.

00:44.000 --> 00:47.000
I only see them once a year here in Brussels.

00:47.000 --> 00:49.000
Yeah, I'm good to be back.

00:49.000 --> 00:53.000
Anyway, today I'm going to talk about this project called Darkling.

00:53.000 --> 00:56.000
It's a open source project, obviously.

00:56.000 --> 01:03.000
And it's about processing, digesting, parsing documentation, keeping the meaning of it.

01:03.000 --> 01:10.000
And consistently, you know, like being able to kind of do that with different kind of documentation,

01:10.000 --> 01:12.000
document formats.

01:12.000 --> 01:22.000
And this in a row really reliably organized way is kind of my take on it when I was coming out with the proposal for this talk.

01:22.000 --> 01:24.000
So my name is Carol Chen.

01:24.000 --> 01:26.000
I'm from Red Hat.

01:26.000 --> 01:28.000
This is an open source project.

01:28.000 --> 01:31.000
It's actually part of the Linux Foundation.

01:31.000 --> 01:35.000
And I'm just, you know, part of the community.

01:35.000 --> 01:36.000
I really like this project.

01:36.000 --> 01:39.000
And I would like to share that with people.

01:39.000 --> 01:42.000
Before I start, I would say I'm not a technical writer.

01:42.000 --> 01:45.000
I'm not a documentation subject matter expert.

01:45.000 --> 01:51.000
You know, like people like Daniel and other people in this room talking here.

01:51.000 --> 01:54.000
It has a lot more experience in a lot of the things.

01:54.000 --> 02:00.000
So apologies in advance if I get some terminology or concepts wrong.

02:00.000 --> 02:05.000
And feel free to give me feedback on like the Fediver's Macedon.

02:05.000 --> 02:08.000
I'm a Macedon or Matrix.

02:08.000 --> 02:12.000
And you know, LinkedIn, if you want to, LinkedIn there.

02:12.000 --> 02:20.000
So you can also send me a PDF or write me a feedback in a document format.

02:20.000 --> 02:23.000
A DocX or HTML.

02:23.000 --> 02:24.000
Why?

02:24.000 --> 02:32.000
Because docling can help me process parts and make sense of information from different document formats.

02:32.000 --> 02:35.000
So here's the doc in project.

02:35.000 --> 02:38.000
The key things is it can parse different formats.

02:38.000 --> 02:40.000
Like I just mentioned.

02:40.000 --> 02:46.000
And one of the key things is the advanced parsing of PDFs.

02:46.000 --> 02:52.000
We talk a lot about, you know, the different ways to represent information and documentation.

02:52.000 --> 02:56.000
A lot of it is about how it looks visually.

02:56.000 --> 02:58.000
Like mark down mark up.

02:58.000 --> 02:59.000
Whatever.

02:59.000 --> 03:01.000
You're looking at this headers.

03:01.000 --> 03:07.000
There's like paragraphs, sections, tables, diagrams and stuff like that.

03:07.000 --> 03:15.000
PDF is one of those formats that can encapsulate and reproduce reliably across different platforms.

03:15.000 --> 03:22.000
You know, I made a copy of this presentation in PDF just in case I couldn't present it on my own laptop.

03:22.000 --> 03:24.000
And we have to use some other laptop.

03:24.000 --> 03:28.000
So a lot of times, there's a lot of information.

03:28.000 --> 03:33.000
But it's great for humans to read it, to digest it, to understand it.

03:33.000 --> 03:38.000
But machines, you know, that's why it's really probably quite challenging to parse PDFs.

03:39.000 --> 03:54.000
So doclin has this whole process pipeline to make use of certain AI models to be able to do a very thorough and accurate parsing of PDFs.

03:54.000 --> 04:01.000
But besides that, there's also like, since the project started, this was not like right away from the start.

04:01.000 --> 04:08.000
It's a support for things like images and audio files with OCR and ASR.

04:08.000 --> 04:11.000
So there's different pipelines for the different file formats.

04:11.000 --> 04:13.000
So it's multi-model.

04:13.000 --> 04:22.000
And while the key concepts for the blockchain is this unified, expressive doclin document representation format,

04:22.000 --> 04:30.000
where it captures no matter what kind of input you give it, to be able to capture and preserve the meaning behind

04:30.000 --> 04:43.000
the document. So that, especially because when you want to translate it or convert it to some other format, that meaning of the document is not lost.

04:43.000 --> 04:52.000
A lot of times when you do like conversion or parsing, like it's one to one, maybe like PDF to HTML or PDF to mark down.

04:52.000 --> 04:58.000
But, you know, sometimes going through the process from one to the other, you lose some information data.

04:58.000 --> 05:10.000
So by converting that into like an intermediate unified format, it's able to do that and you can then use that for outputting to different other different formats.

05:10.000 --> 05:16.000
So I'll get into a little bit of that more in the coming slides.

05:16.000 --> 05:19.000
Another key thing is the local execution.

05:19.000 --> 05:24.000
A lot of trans, I'll say, convert this, I've used myself.

05:24.000 --> 05:29.000
You know, you have to kind of upload the file somewhere to some cloud.

05:29.000 --> 05:37.000
And I guess it's fine if you're working on documentation for an open source project, most of the things are in the open.

05:37.000 --> 05:45.000
But sometimes you might just want to work with something that's a little bit more sensitive, or, you know, before a project release,

05:45.000 --> 05:51.000
there's something that you don't want things to kind of get leaked or anything like that.

05:51.000 --> 05:57.000
And, you know, can you trust all the different cloud services or SaaS out there, right?

05:57.000 --> 06:07.000
So, without being, you can download everything on your computer, on your laptop or on-prem and be able to execute that locally.

06:07.000 --> 06:12.000
So, yeah, I'll get into some of these later on.

06:12.000 --> 06:14.000
And it's a Python package.

06:14.000 --> 06:18.000
So, and you can run, you just pick and install a dockling.

06:18.000 --> 06:26.000
And there's also a CLI version, of course, you can also include it as a Python library, another app.

06:26.000 --> 06:28.000
So, first, very versatile.

06:28.000 --> 06:33.000
And let's look at more about the project.

06:33.000 --> 06:36.000
I'll try to help promote it.

06:36.000 --> 06:42.000
And it started, I think, and near the end of 2024.

06:42.000 --> 06:48.000
So, more than a year ago, slightly more than a year ago, around, right here, you know,

06:48.000 --> 06:55.000
November, 24, it started big job in how to say popularity.

06:55.000 --> 06:58.000
And it was like, for a while, trending on GitHub.

06:58.000 --> 07:02.000
And I know the dockling team would love it.

07:02.000 --> 07:06.000
If I say this, please, you know, like and subscribe.

07:06.000 --> 07:10.000
I mean, like and follow the GitHub repose.

07:10.000 --> 07:14.000
And, you know, lots of popularity and lots of contributors.

07:14.000 --> 07:17.000
It's not just about using it, right?

07:17.000 --> 07:20.000
People find it useful and want to add stuff to it.

07:20.000 --> 07:25.000
So, we have a very active community as well behind this project.

07:25.000 --> 07:29.000
And, yeah, I was just looking at Python.

07:29.000 --> 07:32.000
The down numbers are also pretty amazing.

07:32.000 --> 07:37.000
So, take a look at GitHub.com slash stopping project.

07:37.000 --> 07:43.000
And there's a lot of, of course, repose under that organization.

07:43.000 --> 07:49.000
Now, let's take a step back and imagine that you are reading a paper,

07:49.000 --> 07:53.000
comparing the English language and the Finnish language.

07:53.000 --> 07:56.000
I don't know why you want to do that, but let's just go with that.

07:56.000 --> 08:06.000
So, we know that the English language has, sorry, the English alphabet has 26 letters, right?

08:06.000 --> 08:08.000
The Finnish has 29.

08:08.000 --> 08:14.000
So, if you, you know, know the language that's probably not very surprising,

08:14.000 --> 08:21.000
there's this idea of pengram, which is like the shortest sentence you can form

08:21.000 --> 08:25.000
with that contains all the letters in the alphabet.

08:25.000 --> 08:31.000
And I think also many of us know that, so the paper goes right.

08:31.000 --> 08:38.000
So, an example of a short sentence that contains all the 26 letters in the English alphabet

08:38.000 --> 08:46.000
is, quote, the quick brown fox jumps over page 6, horizontal line, footnote.

08:46.000 --> 08:52.000
Oh, here's a study about how kind of change may impact the speed of brown foxes.

08:52.000 --> 08:57.000
Seven, the lady dog, and quote.

08:57.000 --> 08:59.000
Wait a minute.

08:59.000 --> 09:05.000
That was quite a short sentence, and I'm well sure it contains other 26 letters,

09:05.000 --> 09:09.000
but you won't quote me on that.

09:09.000 --> 09:12.000
You won't quote the paper like that, right?

09:12.000 --> 09:16.000
Of course, it's also a fake scenario, because that's not such paper yet.

09:16.000 --> 09:21.000
And if I ever get to write it, if I get my finish, you know, improved enough to write that.

09:21.000 --> 09:26.000
I will make sure to have that page break over the lazy dog.

09:26.000 --> 09:34.000
But there are actually real cases of stuff like that happening, you know, right?

09:34.000 --> 09:38.000
Now, this is actually a paper from the 50s.

09:38.000 --> 09:42.000
Some 1959 article, yeah, this is there.

09:42.000 --> 09:47.000
And it talks about, I have no idea what it talks about, but that's this part.

09:48.000 --> 09:52.000
It's actually in two columns, that says, vegetative, electron,

09:52.000 --> 09:55.000
multiple copy, okay, whatever, right?

09:55.000 --> 10:02.000
But it's actually not, you know, in the same paragraph, just, you know, half a century later,

10:02.000 --> 10:08.000
a bunch of research scientific papers actually quote the study and say,

10:08.000 --> 10:14.000
oh, where it transform an infrared spectroscopy, vegetative,

10:14.000 --> 10:17.000
electron, microscopy, blah, blah, blah, blah, okay?

10:17.000 --> 10:19.000
Now, that's not the only one.

10:19.000 --> 10:24.000
If you search like in the Google scholar site, that's like,

10:24.000 --> 10:27.000
site of 112 papers.

10:27.000 --> 10:37.000
So, we can see how parsing wrong information from documents can have an average effect.

10:37.000 --> 10:41.000
That may be a bit more extreme, extreme case.

10:41.000 --> 10:46.000
And sure, you know, probably most parses now can and do something like a two column thing.

10:46.000 --> 10:49.000
Oh, we've been out, I don't know, because when you have images and graphs,

10:49.000 --> 10:54.000
it also can cause more confusion.

10:54.000 --> 11:02.000
When we use top thing on the same paper, we see that it accurately checks and

11:02.000 --> 11:05.000
understands the layout, at the two columns.

11:05.000 --> 11:10.000
And, you know, parsed it correctly, like what happens to the vegetative cell wall,

11:10.000 --> 11:16.000
when the source release blah, blah, blah, and then, you know, effects by means of

11:16.000 --> 11:18.000
electron, microscopy, blah, blah, blah.

11:18.000 --> 11:23.000
So, again, having the right understanding of the PDF document,

11:23.000 --> 11:27.000
made a big difference.

11:27.000 --> 11:35.000
It's one thing to get kind of the meaning wrong is another just kind of losing content all

11:35.000 --> 11:36.000
together.

11:36.000 --> 11:43.000
A lot of like basic parses probably, you know, like the quick brown fox case,

11:43.000 --> 11:49.000
you know, put the header and the fooder and all those fun information together

11:49.000 --> 11:55.000
with the context that I mean the main part body of the paper.

11:55.000 --> 12:02.000
So, the meaning is kind of included or kind of diluted by this kind of undesired

12:02.000 --> 12:09.000
page header, headers, tables become like a list of numbers that has no relation

12:09.000 --> 12:10.000
to each other.

12:10.000 --> 12:14.000
You don't understand what this list of numbers mean.

12:14.000 --> 12:17.000
The whole image is missing and line wraps.

12:17.000 --> 12:23.000
It's just like chunks of lines, broken and there's no wrap.

12:23.000 --> 12:29.000
So, you know, like when you, again, like parse those lines, you lose that context

12:29.000 --> 12:36.000
and you may like, you know, break them out in the wrong places and stuff like that.

12:36.000 --> 12:39.000
So, talking to the rescue.

12:39.000 --> 12:40.000
Markdown.

12:40.000 --> 12:41.000
I love Markdown.

12:41.000 --> 12:48.000
It's one of those formats that's easily understood whether it's like, yeah,

12:48.000 --> 12:52.000
actually formatting thing, the kind of the raw part as well as, of course,

12:52.000 --> 12:57.000
you can then make very kind of clear and structured output.

12:57.000 --> 13:04.000
So, in that sense, Markdown is usually not the problematic one, right?

13:04.000 --> 13:09.000
But, again, like I was saying, PDFs tend to be one of those formats that,

13:09.000 --> 13:15.000
even though it's not easy to kind of create or whatever,

13:15.000 --> 13:23.000
but it does have to capture that visual representation with diagrams and tables and stuff like that.

13:23.000 --> 13:28.000
But then once you have that to kind of go reverse, it's really difficult.

13:28.000 --> 13:36.000
So, when talking first started, they created this PDF pipeline to kind of address that more specifically.

13:36.000 --> 13:40.000
Like Markdown's HTML, those are a little bit easier.

13:40.000 --> 13:47.000
They have a kind of a problematic pipeline and kind of a bit more straightforward parsing,

13:47.000 --> 13:48.000
but PDFs.

13:48.000 --> 13:53.000
First, you have to use an OCR to kind of scan a whole page.

13:53.000 --> 13:58.000
Find out where's the tags, where's the pictures, information, whatever.

13:58.000 --> 14:05.000
And there's actually two key parts of the process here, a layer analysis.

14:05.000 --> 14:08.000
So, again, like the column flow, right?

14:08.000 --> 14:10.000
Is it two columns, three columns?

14:10.000 --> 14:15.000
You know, when you have a line under that diagram,

14:15.000 --> 14:20.000
it's that part of the paragraph below, or the image above, they kind of thing.

14:20.000 --> 14:25.000
The table structure, tables are simple yet, not really,

14:25.000 --> 14:30.000
because you can have, like, multi columns that, you know,

14:30.000 --> 14:33.000
spans across multiple columns and rows.

14:33.000 --> 14:36.000
You can have tables within the table and so on.

14:36.000 --> 14:46.000
So, this is actually two small models that they use to kind of do this job specifically.

14:46.000 --> 14:53.000
Like, instead of throwing the document into a large LLM, you know,

14:53.000 --> 14:55.000
two B7B, whatever, right?

14:55.000 --> 15:00.000
Sure, they can probably do a very decent kind of general,

15:00.000 --> 15:05.000
parsing of that, but maybe it's like 70-80% accurate.

15:05.000 --> 15:09.000
If, maybe some simple PDFs, and that's fine,

15:09.000 --> 15:12.000
and, you know, you're happy with the results,

15:12.000 --> 15:17.000
but for more kind of specific use cases, you want models

15:17.000 --> 15:20.000
that design to do that specific task.

15:20.000 --> 15:24.000
So, these are, they kind of models, it's small, tiny models,

15:24.000 --> 15:29.000
but if I remember correctly, like 40, 42, and million, rather than billion,

15:29.000 --> 15:34.000
you know, so downloads on your laptop, you can process everything locally,

15:34.000 --> 15:38.000
like I said, and because they are trained to just do,

15:38.000 --> 15:44.000
what one to just do layout analysis and other just to do table structure.

15:44.000 --> 15:47.000
So, they are very good at that task.

15:47.000 --> 15:50.000
They're good at nothing else, but those two, those specific tasks,

15:50.000 --> 15:53.000
and be able to do that.

15:54.000 --> 15:58.000
Later on, the team also developed this VLM pipeline,

15:58.000 --> 16:03.000
which is a slightly bigger model, like 258 million parameters,

16:03.000 --> 16:08.000
but still, small compared to most LLMs or Foundation models out there.

16:08.000 --> 16:14.000
So, this VLM pipeline takes the step further, and do all this,

16:14.000 --> 16:18.000
there is design to do all these steps in one go,

16:18.000 --> 16:21.000
so you don't have to kind of go through the whole thing,

16:21.000 --> 16:29.000
and be more efficient in the understanding and the parsing of the PDF document.

16:29.000 --> 16:35.000
It does take more resources, but it also produces more accuracy,

16:35.000 --> 16:39.000
more accurate results, and you know, if you can get something back,

16:39.000 --> 16:44.000
that's like 92, 95% accurate, sure, it's never 100%,

16:44.000 --> 16:49.000
but you spent a lot less time, we always want to double check the work,

16:49.000 --> 16:53.000
and we don't 100% trust the AI model.

16:53.000 --> 16:57.000
So, but you have a model say, you can save a lot of time

16:57.000 --> 17:02.000
in that kind of follow-up checking work compared to general models.

17:02.000 --> 17:07.000
So, again, this is the kind of the intermediate document format

17:07.000 --> 17:12.000
that I talked about, and then, which then can be exported to different,

17:12.000 --> 17:14.000
oh, my goodness, I'm taking too much line.

17:14.000 --> 17:21.000
Okay, so when I saw, when I first heard about this document document format,

17:21.000 --> 17:24.000
I was like, you know, this is like, yeah, another one.

17:24.000 --> 17:27.000
But the key thing is, like I said, you know, you want something

17:27.000 --> 17:32.000
that you able to have almost lost this representation of that information data,

17:32.000 --> 17:34.000
so that you can be able to parse it out to,

17:34.000 --> 17:37.000
walk down to, I think that's an issue in my own and different ones,

17:37.000 --> 17:42.000
and one of the interesting things that

17:42.000 --> 17:46.000
I'm adopting wanted to solve was parsing for

17:46.000 --> 17:51.000
AM models to train, to fine tune, for rag, and stuff like that.

17:51.000 --> 17:54.000
So, I'll talk about chunking a little bit.

17:54.000 --> 17:56.000
So, these are the input formats.

17:56.000 --> 17:58.000
I think the list is growing every time this,

17:58.000 --> 18:02.000
and you really say something new supported as well as the output formats.

18:02.000 --> 18:03.000
And there's a thing, right?

18:03.000 --> 18:07.000
It's an open source project, and it's able,

18:07.000 --> 18:09.000
if there's some formats that you care about,

18:10.000 --> 18:15.000
the ability, because of the standard document format,

18:15.000 --> 18:19.000
you can then, you know, write your own kind of input for two or

18:19.000 --> 18:21.000
that, and then also output.

18:21.000 --> 18:26.000
So, the update list is probably on this link,

18:26.000 --> 18:30.000
but this was when I grabbed it like maybe a week or two ago.

18:30.000 --> 18:36.000
So, it's just a lot of 10 code details about the document format.

18:36.000 --> 18:41.000
Again, you can see from the link, but basically it expresses

18:41.000 --> 18:48.000
a lot of the things that a lot of times it's not easy to capture

18:48.000 --> 18:54.000
with just simple, like, how to say.

18:54.000 --> 18:57.000
Like, you preserve that, okay, we know this is a text,

18:57.000 --> 18:59.000
we know this is a table, and how it relates to things.

18:59.000 --> 19:01.000
So, and also the hierarchy,

19:01.000 --> 19:05.000
we know H1 is under, sorry, H2 is under H1,

19:05.000 --> 19:10.000
and so on. So, a lot of parsing may not be able to capture that.

19:10.000 --> 19:15.000
So, it does that, and then, again, like I said,

19:15.000 --> 19:20.000
the APIs to build the document from scratch with different formats.

19:20.000 --> 19:24.000
Like I said, it does a CLI, you can use that,

19:24.000 --> 19:26.000
but I also wanted to introduce this document project,

19:26.000 --> 19:31.000
actually done by somebody in my team, in Red Hat,

19:32.000 --> 19:34.000
because we were playing wrong with Doppling,

19:34.000 --> 19:39.000
and I was like, it would be nice to have like a simple UI,

19:39.000 --> 19:42.000
GUI interface that we can, you know, try things out,

19:42.000 --> 19:46.000
instead of trying to check check all the different CLI parameters

19:46.000 --> 19:47.000
and configurations.

19:47.000 --> 19:51.000
So, if you go to ducking, that's UI.org,

19:51.000 --> 19:55.000
there is the, you can find how to use this.

19:55.000 --> 19:59.000
And I kept this non-sensical thing here,

19:59.000 --> 20:03.000
because I fell asleep at 2am, updating the slides.

20:03.000 --> 20:07.000
And I don't know if it was my hand or my face was on the laptop,

20:07.000 --> 20:10.000
and I thought, well, this is artifact for Boston.

20:10.000 --> 20:14.000
Let's, I'll, I'll feed these to ducking and see what it says about that.

20:14.000 --> 20:16.000
But, that's for next time.

20:16.000 --> 20:20.000
I was just chatting with my teammates yesterday.

20:20.000 --> 20:23.000
He said, that's the latest release, which, of course, I have to install,

20:23.000 --> 20:27.000
and it supports internationalization.

20:27.000 --> 20:30.000
So, you can see, like, now it has this,

20:30.000 --> 20:33.000
I think German and French, yeah.

20:33.000 --> 20:35.000
So, and I think a couple of others.

20:35.000 --> 20:38.000
So, let's see, we can just run the,

20:38.000 --> 20:42.000
actually, I already ran the demo,

20:42.000 --> 20:44.000
because it takes a couple of minutes.

20:44.000 --> 20:47.000
So, I wasn't going to, like, let it run,

20:47.000 --> 20:52.000
but it is live running on my laptop.

20:52.000 --> 20:55.000
And I also tested this,

20:56.000 --> 20:58.000
without, right now, it's connected to the Boston Wi-Fi,

20:58.000 --> 21:01.000
but I tested it without, just in case it was working.

21:01.000 --> 21:04.000
And again, just to prove that it works locally,

21:04.000 --> 21:07.000
you don't need to, it connection after you download all the models.

21:07.000 --> 21:10.000
So, you can perform everything on your machine.

21:10.000 --> 21:13.000
So, this converted, I had, like,

21:13.000 --> 21:18.000
some free, sensible, ebook that just downloaded,

21:18.000 --> 21:22.000
in PDF format, extracted images from it,

21:23.000 --> 21:25.000
understood all the tables.

21:25.000 --> 21:29.000
That was in the PDF and chunking.

21:29.000 --> 21:30.000
Right.

21:30.000 --> 21:35.000
Most AI models, when they take input to, you know,

21:35.000 --> 21:37.000
do rag or do whatever,

21:37.000 --> 21:40.000
the context window and, you know,

21:40.000 --> 21:42.000
it's limited number of tokens.

21:42.000 --> 21:44.000
There's kind of the most primitive chunking,

21:44.000 --> 21:46.000
we'll just say, okay, 500 tokens.

21:46.000 --> 21:48.000
I'm just going to count 500 tokens,

21:49.000 --> 21:51.000
but then you lose meaning,

21:51.000 --> 21:53.000
and you lose context when you do that.

21:53.000 --> 21:55.000
So, here you can see,

21:55.000 --> 21:57.000
you actually did hundreds of something chunks,

21:57.000 --> 21:59.000
and each one is like, I want to show,

21:59.000 --> 22:01.000
let me be this here,

22:01.000 --> 22:04.000
like, you know, heading.

22:04.000 --> 22:07.000
It's clear, like, this heading is abstract.

22:07.000 --> 22:10.000
This one is one introduction to getting started.

22:10.000 --> 22:12.000
It's able to,

22:12.000 --> 22:13.000
and because,

22:13.000 --> 22:15.000
docking understands format,

22:16.000 --> 22:18.000
it creates blocks of information.

22:18.000 --> 22:20.000
So, you understand that it's able to

22:20.000 --> 22:22.000
group that as a chunk.

22:22.000 --> 22:24.000
And if it's like,

22:24.000 --> 22:30.000
within the context window limit,

22:30.000 --> 22:31.000
you will just use that.

22:31.000 --> 22:34.000
If not, you'll find a appropriate way to,

22:34.000 --> 22:37.000
kind of chunk that into smaller blocks,

22:37.000 --> 22:39.000
but still be able to kind of keep that,

22:39.000 --> 22:41.000
cementing information.

22:41.000 --> 22:44.000
And then I just, again, want to quickly show,

22:44.000 --> 22:46.000
like,

22:46.000 --> 22:47.000
docking versus,

22:47.000 --> 22:48.000
this is docking.

22:48.000 --> 22:49.000
It's able to, like, you know,

22:49.000 --> 22:52.000
pass the index with page numbers accurately.

22:52.000 --> 22:53.000
This was,

22:53.000 --> 22:54.000
I can remember, I used,

22:54.000 --> 22:56.000
market down on some other parser.

22:56.000 --> 22:57.000
It's like, you know,

22:57.000 --> 22:58.000
block,

22:58.000 --> 23:00.000
numbers, block, numbers.

23:00.000 --> 23:01.000
What's chapter one?

23:01.000 --> 23:02.000
I have no idea.

23:02.000 --> 23:03.000
Here, you can see,

23:03.000 --> 23:05.000
about chapter one,

23:05.000 --> 23:07.000
it keeps all the, you know,

23:07.000 --> 23:08.000
heading and information,

23:08.000 --> 23:10.000
and be able to output that accordingly.

23:10.000 --> 23:12.000
So,

23:12.000 --> 23:13.000
again, this is,

23:13.000 --> 23:14.000
this was from,

23:14.000 --> 23:16.000
I just used the,

23:16.000 --> 23:18.000
the startling UI to be able to do that.

23:18.000 --> 23:19.000
So, making it easy for you.

23:19.000 --> 23:21.000
You don't have to remember,

23:21.000 --> 23:23.000
the, you know,

23:23.000 --> 23:25.000
all this CLI references.

23:25.000 --> 23:28.000
You can access that from here,

23:28.000 --> 23:29.000
right from the setting.

23:29.000 --> 23:30.000
You can enable,

23:30.000 --> 23:31.000
OCR.

23:31.000 --> 23:32.000
You can select one.

23:32.000 --> 23:34.000
You can even switch OCR,

23:34.000 --> 23:35.000
and that if it's not installed,

23:35.000 --> 23:36.000
you install the automatically,

23:36.000 --> 23:37.000
except for test drag,

23:37.000 --> 23:39.000
which is since system white install,

23:39.000 --> 23:40.000
so you can do that.

23:41.000 --> 23:43.000
Then add options,

23:43.000 --> 23:44.000
which, again,

23:44.000 --> 23:45.000
you don't have to remember all the,

23:45.000 --> 23:46.000
all the different parameters.

23:46.000 --> 23:47.000
You can just,

23:47.000 --> 23:49.000
turn them on and off here.

23:49.000 --> 23:51.000
And then there's even documentation,

23:51.000 --> 23:52.000
built right in,

23:52.000 --> 23:53.000
MK docs,

23:53.000 --> 23:54.000
perfect markdown,

23:54.000 --> 23:55.000
very easy.

23:55.000 --> 23:57.000
And I've one minute left.

23:57.000 --> 23:58.000
So,

23:58.000 --> 24:01.000
let's see what's my slides.

24:01.000 --> 24:03.000
So, feel free to,

24:03.000 --> 24:05.000
check it out about chunking.

24:05.000 --> 24:06.000
As I clearly,

24:06.000 --> 24:07.000
like I said,

24:07.000 --> 24:08.000
it's about, you know,

24:08.000 --> 24:09.000
splitting it in the right way,

24:09.000 --> 24:11.000
it preserves meaning and context,

24:11.000 --> 24:12.000
and,

24:12.000 --> 24:14.000
be able to understand layout

24:14.000 --> 24:17.000
and reduce hallucination for the models.

24:17.000 --> 24:18.000
All right.

24:18.000 --> 24:19.000
So,

24:19.000 --> 24:21.000
I will upload all these slides,

24:21.000 --> 24:24.000
maybe I'll remove that chunk of nonsense letters,

24:24.000 --> 24:26.000
but I'll upload this to pre-talk,

24:26.000 --> 24:27.000
so you'll be able to see,

24:27.000 --> 24:30.000
get all these links for both document and document.

24:30.000 --> 24:32.000
And that's it.

24:32.000 --> 24:33.000
The document team,

24:33.000 --> 24:36.000
by the way, is from IBM Research in Zurich.

24:36.000 --> 24:38.000
You know, this is not the full list.

24:38.000 --> 24:40.000
We have them with, you know,

24:40.000 --> 24:42.000
of course, a lot of these images

24:42.000 --> 24:46.000
and the docling logo is from them,

24:46.000 --> 24:50.000
but the docling author is from Red Hat David.

24:50.000 --> 24:51.000
Hi, David.

24:51.000 --> 24:53.000
Thank you so much for all your help.

24:53.000 --> 24:54.000
If you're watching this,

24:54.000 --> 24:56.000
amazing stuff.

24:56.000 --> 24:57.000
So, thank you.

24:57.000 --> 24:58.000
If you have any questions,

24:58.000 --> 24:59.000
like I said,

24:59.000 --> 25:01.000
feel free to look for me as sidebat.

25:01.000 --> 25:03.000
And that's my time.

25:03.000 --> 25:05.000
Thank you very much.

25:05.000 --> 25:08.000
Thank you.

25:08.000 --> 25:11.000
Thank you.

