WEBVTT

00:00.000 --> 00:07.000
Okay, hi everyone.

00:07.000 --> 00:18.000
Well, this is now the exact moment of this open research bedroom, and thank you for being here.

00:18.000 --> 00:23.000
But first, let me introduce myself and my team.

00:24.000 --> 00:27.000
Well, actually, I'm not alone on this slide.

00:27.000 --> 00:34.000
Because I didn't want to take credit for a tool that I didn't develop myself or at least, well,

00:34.000 --> 00:37.000
I'm not a regional computer, I'd say.

00:37.000 --> 00:43.000
Because the main developer of exam is Yom Kiddereo, Guillem Plit.

00:43.000 --> 00:48.000
And sadly, he couldn't be here today, so instead you have me.

00:48.000 --> 00:55.000
There's a lot of research engineers.

00:55.000 --> 01:04.000
And you also have a bunch of more active in our manager and also a very enthusiastic user of them.

01:04.000 --> 01:11.000
And we all work as research engineers in the Media Lab, which is a sociology lab.

01:11.000 --> 01:19.000
And you'll see later why this is important, because actually exam was developed in this specific context.

01:19.000 --> 01:21.000
So, what is exam?

01:21.000 --> 01:31.000
It's actually a common line tool that you can use, so in your terminal, two manipulates CSV data.

01:31.000 --> 01:41.000
And the idea is not ours, because exam has actually been fought from another project called XSV.

01:41.000 --> 01:43.000
That is not maintained anymore.

01:43.000 --> 01:48.000
And it was created in Rust by burnt sushi.

01:48.000 --> 01:55.000
If you don't live up in Rust, know that burnt sushi developed many half of the current Rust ecosystem.

01:55.000 --> 01:58.000
So, he's really a celebrity.

01:58.000 --> 02:03.000
He was apparently decided that XSV was not his priority anymore.

02:03.000 --> 02:13.000
And so now, exam was completely rewritten from that baseline to fit our needs.

02:13.000 --> 02:17.000
So, why do we use exam?

02:17.000 --> 02:21.000
And this is going to be the outline of my presentation.

02:21.000 --> 02:25.000
We use then because first we love the CSV format.

02:25.000 --> 02:28.000
Second, because we love the terminal, of course.

02:28.000 --> 02:31.000
Third, for sure.

02:31.000 --> 02:39.000
Because then is easy to use and fourth, because exam is fast.

02:39.000 --> 02:42.000
So, why we love the CSV formats?

02:42.000 --> 02:48.000
I underlined the we, because apparently this is an unpopular opinion.

02:49.000 --> 02:53.000
And well, there are many reasons why we should love the CSV formats.

02:53.000 --> 02:58.000
Again, we're ready to love letter to the CSV format that you can find on night.

02:58.000 --> 03:07.000
And it's an unpopular opinion, because, well, this love letter started a flame war on Hacker News.

03:07.000 --> 03:14.000
With hundreds of people like telling how much they hate the CSV formats.

03:14.000 --> 03:19.000
So, well, let me focus maybe on one point why we love it.

03:19.000 --> 03:23.000
It's mainly because, actually, we don't work with engineers.

03:23.000 --> 03:25.000
We work with social scientists.

03:25.000 --> 03:27.000
And these are normal people.

03:27.000 --> 03:29.000
They like.

03:29.000 --> 03:35.000
They like reading text and being able to edit text in a text show format.

03:35.000 --> 03:42.000
And since CSV is that simple and CSV is text, this is the first reason.

03:43.000 --> 03:47.000
The second reason is because social scientists are used to table a data.

03:47.000 --> 03:52.000
Well, I have to include myself in the lot after 60 years working with those people.

03:52.000 --> 03:57.000
So, we as social scientists, we often think it's in terms of variables, so columns.

03:57.000 --> 04:02.000
Spread across a population of individuals, rose.

04:02.000 --> 04:09.000
And well, we actually have lots of different tools dealing with table a data.

04:09.000 --> 04:13.000
So, it's an Excel, a data, or pandas, you name it.

04:13.000 --> 04:20.000
And so, CSV is actually the formats to interpret those varieties of tools.

04:20.000 --> 04:25.000
So, why do we love the terminal?

04:25.000 --> 04:36.000
Well, maybe, maybe you could say we love it because we are old.

04:36.000 --> 04:40.000
But, let me focus on some figures.

04:40.000 --> 04:44.000
We love it because we do repetitive tasks.

04:44.000 --> 04:48.000
And so, the terminal is actually a short cut.

04:48.000 --> 04:51.000
And well, thank you Paul for sending the video.

04:51.000 --> 04:53.000
It was perfect.

04:53.000 --> 04:58.000
Second, because it's actually an archive of our previous activities.

04:58.000 --> 05:05.000
And third, because the family doesn't change much, or at least hadn't changed much in the last decades.

05:05.000 --> 05:10.000
So, actually, we can continue doing our repetitive tasks always the same way.

05:10.000 --> 05:13.000
And we like it.

05:13.000 --> 05:18.000
So, now that you understand why we love the CSV format and why we use the seminal,

05:18.000 --> 05:30.000
let me explain why you should and actually we do manipulate our CSV data in the terminal with that.

05:30.000 --> 05:35.000
And we do it because exam is really easy to use.

05:35.000 --> 05:40.000
And well, first reason why exam is easy to use.

05:40.000 --> 05:49.000
It's actually because you don't need to know a lot of exam to get started and actually to adopt it very fast.

05:49.000 --> 05:57.000
So, actually, you just need to know maybe two, three, four commands exam to actually benefit from it.

05:57.000 --> 06:06.000
And maybe you will have the rest of your data management pipeline in another cutting language, maybe in Python, maybe in R who knows.

06:06.000 --> 06:11.000
So, just don't feed the doc with the quick tour, test it.

06:11.000 --> 06:16.000
And you will probably adopt it, maybe just for those quick use cases.

06:16.000 --> 06:18.000
So, what commands are the most useful?

06:18.000 --> 06:28.000
To answer that question, I asked five colleagues to search for their exam history and send me the result.

06:28.000 --> 06:38.000
So, let me present this highly representative survey.

06:38.000 --> 06:46.000
Well, the most important command, of course, is Xanview, alias Xanvi.

06:46.000 --> 06:53.000
And it's actually really the only command that you should know if you should only know one.

06:53.000 --> 07:02.000
It only shows a tabular file in a nice format, of course, with the colors depending on the type, et cetera.

07:03.000 --> 07:12.000
But also, it's the last command that you will use in the end of your exam pipeline to show the results of what you did.

07:12.000 --> 07:23.000
Then, there is Xan headers, or Xan H, and it shows, well, the headers of your file header.

07:23.000 --> 07:29.000
So, the syntax here is also simple, Xan headers, my file.

07:29.000 --> 07:36.000
And then, well, if they are duplicates in the name of the columns, they will be shown in red.

07:36.000 --> 07:44.000
Actually, I was surprised by this one, being the third one, because it's not the easiest.

07:44.000 --> 07:51.000
Because, actually, Xan Map is applying the same operation to all rows in your CSV file.

07:52.000 --> 07:59.000
But, of course, you have to then know a bit about the syntax of the operations that you can apply.

07:59.000 --> 08:01.000
So, maybe it's not for today.

08:01.000 --> 08:08.000
Maybe today, you can focus on Xan's search, which is really easy to use.

08:08.000 --> 08:12.000
You just do Xan's search, and then a phrase.

08:12.000 --> 08:17.000
Of course, you can also use regax, apply it on your file.

08:17.000 --> 08:20.000
And then, you can do it to visualize the results.

08:20.000 --> 08:23.000
Here, I selected one specific column.

08:23.000 --> 08:27.000
And you get all the lines that contain your search.

08:27.000 --> 08:39.000
And the last, for today, but, of course, not the last Xan, it's Xan frequency that apparently everyone calls Xan Frek.

08:39.000 --> 08:45.000
And it comes to the frequency of a modality in a column.

08:45.000 --> 08:56.000
So, for example, here, apparently most lines were created in 2022 October.

08:56.000 --> 09:06.000
The second reason why Xan is really easy to use is, actually, because it takes CSV in, and it gives CSV out.

09:06.000 --> 09:11.000
So, of course, you can write the results of your pipeline in a new CSV file.

09:11.000 --> 09:14.000
And also, it means that Xan is very modular.

09:14.000 --> 09:18.000
Since you can pipe commands into one another.

09:18.000 --> 09:29.000
And here, we have the perfect example of a user who wrote this on our GitHub three days ago, I think.

09:29.000 --> 09:33.000
Well, I had to note what's the meaning of ETDF.

09:33.000 --> 09:38.000
It's a pretty empirical, cumulative distribution function.

09:38.000 --> 09:43.000
And, well, that's his use case, and he's using Xan to do it.

09:43.000 --> 09:51.000
So, he's piping Xan progress into Xan map, then doing Xan group, et cetera, et cetera, and being with Xan plots.

09:51.000 --> 09:56.000
And, actually, don't make fun of T-Bosh title.

09:56.000 --> 10:07.000
Because you might end up being doing the same, or many worse, because actually, that's the way you programming Xan.

10:07.000 --> 10:17.000
Well, also, Xan is easy to use, because it's full of useful tricks.

10:17.000 --> 10:27.000
And, well, I mean, I will name a few of them that I use a lot, but I probably forgot a lot of them.

10:27.000 --> 10:36.000
The one I really like is the ability to use the wildcard operator when you select columns, which means that you don't have to type the entire name,

10:36.000 --> 10:45.000
of your columns. So, here, for example, I select all columns that contain sine, sorry, in my file,

10:45.000 --> 10:58.000
but it also means that if I don't want to type a site address, I could only write a site A and wildcard and get my column.

10:58.000 --> 11:11.000
Another useful trick is Xan Hizk, that's a plot histogram, but if you use it with temporal data, you can use the dates parameter,

11:11.000 --> 11:22.000
and it will actually add empty rows for missing dates in the data, so that you can have a correct temporal histogram.

11:23.000 --> 11:35.000
Another one is the ability to highlight some phrase when you've flattened the data, so here you have your data that is Xan Hizk flattened,

11:35.000 --> 11:44.000
and you use the highlight here to highlight a report of your data that contains this phrase.

11:44.000 --> 11:53.000
And very quick one, Xan Paral Cut, let you concatenate several CSV files together.

11:53.000 --> 12:09.000
And since it does it in parallel manner, then you can pre-process each file, and so, while this step is not done, after concatenating the data, but before.

12:09.000 --> 12:25.000
So here, for example, you have like a bunch of CSV data from 2022, so we are Xan Paral Cut to those CSV files, and then pre-processing it by searching for parliament,

12:25.000 --> 12:33.000
and then, of course, highlighting the parliament in the result so that we get this data.

12:34.000 --> 12:41.000
Then, of course, let's be honest, the main reason why we use Xan is because it is fast.

12:41.000 --> 12:49.000
So, how fast is Xan? It's the question I will try to answer with the quick demo.

12:49.000 --> 13:01.000
So, actually, I created three scripts, one that is computing frequency using the Python standard library.

13:01.000 --> 13:08.000
One that is computing in using pandas, and one that is computing it using Polars.

13:08.000 --> 13:23.000
So, if you are new to that ecosystem, pandas is actually using a site to accelerate the process, and Polars is actually using Rust to accelerate it.

13:23.000 --> 13:32.000
So, you would expect that the Python standard library is the slow list, we will see that it's not always the case.

13:32.000 --> 13:43.000
Pandas is a bit faster, Polars is a bit faster, and well, of course, the first step of the demo, but let's see that.

13:44.000 --> 13:55.000
So, here, you have some CSV file, few tweets, and many tweets are the two files we will see.

13:55.000 --> 14:01.000
So, let's apply a frequency on the few tweets.

14:02.000 --> 14:14.000
XanFrag, the user up, XanFrag.

14:14.000 --> 14:23.000
Okay, so this is the kind of data that we get, so these are the most retreated user in my dataset, in the small dataset.

14:23.000 --> 14:32.000
Now, let's benchmark it on the peak data set.

14:37.000 --> 14:47.000
Well, actually, you have to know that native data set contains 90 gigabytes.

14:48.000 --> 14:52.000
And so, it does it in 9 seconds.

14:55.000 --> 14:59.000
Okay, then, let's do the same in Python.

15:00.000 --> 15:04.000
Well, do we really have the time for the...

15:06.000 --> 15:09.000
Okay, well, let's keep the two of them.

15:10.000 --> 15:15.000
Well, pandas is slow because it has to load everything in RAM.

15:16.000 --> 15:20.000
So, it takes actually the same time as the Python standard library.

15:21.000 --> 15:23.000
Well, let's see, let's see, polar, which is written in the rest.

15:24.000 --> 15:26.000
So, the same language as, as Xan.

15:28.000 --> 15:36.000
And well, actually, let's use H-top to see also how Polars used the RAM.

15:37.000 --> 15:41.000
So, Python, frequency, Polars.

15:42.000 --> 15:45.000
And then I'm getting it the many tweets file.

15:46.000 --> 15:52.000
And well, actually, I don't want to see up.

15:55.000 --> 15:57.000
Oh, no, no, no, no, no, no, no, no.

15:59.000 --> 16:00.000
Like...

16:00.000 --> 16:09.000
Both, both, both, both by our Xibet.

16:10.000 --> 16:12.000
For us, then.

16:13.000 --> 16:14.000
No!

16:17.000 --> 16:19.000
Oh, did I name it?

16:20.000 --> 16:22.000
Ahahaha.

16:23.000 --> 16:26.000
Clack, like, like, like.

16:27.000 --> 16:29.000
Thank you.

16:30.000 --> 16:48.440
Okay, so, but I don't know if I can see Polaris, many tweets into slash live slash

16:49.440 --> 16:56.440
And let's start it.

16:59.440 --> 17:03.440
Three seconds, so you will tell me, okay, Polaris is actually faster than that.

17:04.440 --> 17:11.440
This is true, but it's also because if I start it again, you will see that it's actually using all

17:11.440 --> 17:14.440
processes as my of my computer.

17:17.440 --> 17:20.440
And actually, I can do the same with Zen.

17:21.440 --> 17:30.440
So if I do Zen threats here, but I use the dash p, then I start it in a parallelized manner,

17:31.440 --> 17:33.440
and then it only takes one second.

17:34.440 --> 17:38.440
So actually using all the processes of my computer, then is actually faster than Polaris.

17:39.440 --> 17:43.440
I think this will be the end of my presentation.

17:44.440 --> 17:48.440
Thank you very much for hearing me, and I'm open for questions.

17:48.440 --> 17:58.440
You are fast to have any questions?

18:05.440 --> 18:07.440
Nope, one.

18:18.440 --> 18:24.440
There are loads of imports, loads of rights, excuse me, for instance.

18:25.440 --> 18:35.440
I'm just curious how you would compare to Zen in terms of speed and so on, even if the interface is going to be different,

18:36.440 --> 18:38.440
because you are out to a guy on the basis.

18:39.440 --> 18:45.440
Yeah, well, well, yeah, sorry, I really, which is a question.

18:45.440 --> 18:51.440
So the question is, have you heard of that TV and how do you compare it to Zen?

18:52.440 --> 18:54.440
And so actually I haven't tested.

18:55.440 --> 19:05.440
But well, the philosophy is quite different, because I guess you need to index your data in a way, or like,

19:06.440 --> 19:12.440
like create your database in a way, and then like send you queries to that database,

19:12.440 --> 19:20.440
which is actually not the case in Zen, because actually you won't work on like the base file, which is just your CSV file.

19:21.440 --> 19:26.440
And there is no, like, first step where you send your data or index your data in that database,

19:27.440 --> 19:32.440
that's since I don't know that TV, I don't know how it works, so maybe it's very fast.

19:35.440 --> 19:40.440
I think in the example that we saw from on GitHub, it was using pipes, others' batch pipes.

19:40.440 --> 19:41.440
Yes?

19:42.440 --> 19:47.440
Is there anything with Zen to keep track of some of the context where you are?

19:48.440 --> 19:50.440
I guess it's like almost catching an explainability of like,

19:51.440 --> 19:55.440
but with 15 things together, how do I know what the fifth step is doing?

19:56.440 --> 19:58.440
Okay, so the question is,

20:00.440 --> 20:04.440
you are piping, exam commands, one into another,

20:04.440 --> 20:11.440
and so do you have any way of seeing in what context something happens?

20:12.440 --> 20:18.440
I guess in order to debug or, well, actually, exam errors, like,

20:19.440 --> 20:24.440
tell you at what exam commands they have failed, so that's the first point,

20:25.440 --> 20:29.440
but maybe you had something more specific in mind?

20:29.440 --> 20:33.440
Yeah, I guess I'm coming from the perspective of the other huge data set,

20:34.440 --> 20:38.440
100 gigs on pro-sets, and you made them stay halfway through.

20:39.440 --> 20:43.440
Is there any way to design, know about the previous steps or the next one to be able to catch anything?

20:44.440 --> 20:45.440
Or is it just standard classified thing?

20:46.440 --> 20:50.440
No, it doesn't know where you come from.

20:51.440 --> 20:57.440
It can tell you just like the line number of what it has previously seen, but that's all.

20:58.440 --> 21:00.440
Let's say those steps.

21:01.440 --> 21:02.440
Yes.

21:18.440 --> 21:22.440
So the question is, is it similar to SQL?

21:23.440 --> 21:28.440
And I won't be able to answer that question because I don't know what SQL is, but maybe you can explain.

21:31.440 --> 21:32.440
SQL.

21:35.440 --> 21:37.440
Okay, so the question is, is it similar to SQL?

21:40.440 --> 21:51.440
Well, in a way, yes, of course it's a way of preparing some data and getting some results.

21:52.440 --> 22:07.440
Well, as I was saying, like using SQL is also like dependent on like using some kind of database.

22:08.440 --> 22:15.440
And so like putting your data in that database and most often like indexing it in a way.

22:16.440 --> 22:21.440
So that's the first thing that you should have in mind before processing your data.

22:22.440 --> 22:32.440
Well, here, really the idea is I can apply example to any file in my, any CSV file in my computer without like having to ask myself, do I have to index it?

22:33.440 --> 22:35.440
Do I have to put it in my database?

22:40.440 --> 22:42.440
Oh, yes, it does.

22:43.440 --> 22:51.440
It's it's mainly for, well, it depends, but it does.

22:52.440 --> 23:03.440
Okay, kind of type inference at the visualization step, so that you can like visualize differently numbers from dates from etc.

23:04.440 --> 23:21.440
And then, well, at the, at the operational step, for example, if you have some comments that are specific to numbers, for example,

23:22.440 --> 23:25.440
well, it will be able to deal with them specifically.

23:26.440 --> 23:32.440
So, yes, yes.

23:32.440 --> 23:33.440
Yes.

23:54.440 --> 23:59.440
Well, for now, like people are just putting issues.

24:00.440 --> 24:02.440
Sorry.

24:03.440 --> 24:12.440
Yes, so do you plan to have a system to have people add their own comments to well.

24:13.440 --> 24:21.440
And the answer is, for now, people are just like putting issues on the GitHub repo and we will do them or not.

24:22.440 --> 24:35.440
I don't think there is a plan to let people have like create that would be aliases in a way like create a pipeline and you name it in a specific way that could be an option.

24:36.440 --> 24:39.440
We didn't discuss it so far, but it's interesting actually.

24:41.440 --> 24:48.440
And well, always there is always the possibility to find the project and add your own comments, but that's a bit heavy.

24:52.440 --> 24:54.440
Yes.

24:59.440 --> 25:02.440
Sorry.

25:05.440 --> 25:10.440
Well, the funder script takes 1 minute to run 1 minute 15 something 15.

25:11.440 --> 25:12.440
Sorry.

25:12.440 --> 25:26.440
And it takes, well, actually, I added a restriction to the columns that are red, so it only reads one column, so it doesn't take all the memory, but it blows the entire column into memory.

25:27.440 --> 25:33.440
And so of course it takes a lot of memory and of course, if you wanted to read the entire the entire file, you would.

25:33.440 --> 25:45.440
And actually the patent standard library, it takes actually same time like 1 minute 15 and doesn't take memory.

26:03.440 --> 26:26.440
Well, well, how do social scientists manage to like keep the comments, they produce really used and all the content.

26:26.440 --> 26:41.440
How do social scientists manage to like keep the comments, they produce really used and all the preprocessing they did, since everything is in the terminal, it's not ideal because you don't have a trace of what you did.

26:41.440 --> 27:10.440
So, many of us are using batch files, so you actually write down your pipeline in that SH file and run it and well, of course, we don't do it for very quick pipelines, but when you're starting to have a very large preprocessing pipeline that's what you do.

27:11.440 --> 27:41.440
So, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know, you know,

27:41.440 --> 27:59.440
you know, you know, you know, when you know, when you know that the school teacher school will answer examination, you know, like you know, we know, at all.

27:59.440 --> 28:02.440
I'm going to tell you that two of them are not going to do it.

28:02.440 --> 28:05.440
I'm not going to do it because of you.

28:05.440 --> 28:06.440
I'm going to do it because of you.

28:06.440 --> 28:21.440
Okay, thank you for the question.

28:21.440 --> 28:35.440
I will have a hard time.

28:35.440 --> 28:42.440
We pity it entirely, but I'm trying to like interpret it.

28:42.440 --> 29:00.440
So the question was, or maybe a remark or was of it about the fact that Xan is also kind of a low-tech technology,

29:00.440 --> 29:14.440
actually you use very little resources and also you are not using an LLM to just visualize your data or plot your data.

29:14.440 --> 29:19.440
Well, I have to say it completely agree.

29:19.440 --> 29:31.440
Well, that's a philosophy or because of you as we were saying that we try to apply at the lab in the tools that we use and we develop.

29:31.440 --> 29:41.440
Of course, like we try to make tools that are robust but also that choose as low resources as possible.

29:41.440 --> 29:50.440
Also, because we have to say that social science researchers, researchers not always have many resources.

29:50.440 --> 29:52.440
Many of them are poor.

29:52.440 --> 29:56.440
They only have their computer that doesn't have many resources.

29:56.440 --> 30:08.440
And so it means that they need to have tools that can be run in their own terminal and not on a huge server with GPU cards or anything like that.

30:08.440 --> 30:15.440
And so, well, the tools we develop try also to answer to that requirement.