WEBVTT

00:00.000 --> 00:12.240
We're back on time, so it's very, I'm very pleased to present Tilo Mathiths who will be

00:12.240 --> 00:16.560
speaking about something that often asks people who sit at computers, kind of forget,

00:16.560 --> 00:24.680
which is that biology is actually an experimental science and over to you.

00:24.760 --> 00:32.680
Right. And I'm also, I'm co-presenting here with Hussie, so it's a very important part

00:32.680 --> 00:36.520
because we're talking essentially about two open source projects that came together to solve

00:36.520 --> 00:43.480
a problem in science and it also has to do with provenance and tracking essentially the

00:43.480 --> 00:50.600
scientific progress. So yeah, I really want to start with a small overview on how you get

00:50.600 --> 00:54.200
from primary data to computational analysis, what is involved in there.

00:54.680 --> 01:00.360
And then we'll both talk about our collaboration, like how we thought about exchanging data

01:00.360 --> 01:08.520
between two tools that do quite different things and essentially leading to connecting computational

01:08.520 --> 01:15.160
workflows with overall research documentation. So I want to start with an example that's a real

01:15.160 --> 01:23.640
person, Rolizsa. She is working with us on certain things and she's one example of a researcher who does

01:23.720 --> 01:30.600
both a lot of wet lab work. So there's a lot of sample preparation, DNA extraction. And then

01:30.600 --> 01:36.200
she hands it over to computational work. And a lot of times a lot of these researchers are doing

01:36.200 --> 01:42.600
like all of these things to some degree, right. So for her to challenge just like how to

01:42.600 --> 01:50.280
aeroblessly connect that primary data, these physical samples to the experimental context and then

01:50.280 --> 01:56.280
the analysis workflows that have been done and afterwards on it. So if you look at the tools and

01:56.280 --> 02:01.080
the workflows and relitz us, you see there's a lot going on, right? There's cultivating cells,

02:01.080 --> 02:08.200
sample preparation, a lot of small data collection, like images, handwritten nodes, digital nodes,

02:08.200 --> 02:13.960
all kind of stuff. The data lifts across different kind of file stores, local computers, etc.

02:13.960 --> 02:19.000
Right. And then essentially a lot of the data is analyzed. So there's code snippets lying around,

02:19.000 --> 02:26.600
node Jupyter node books, HPC scripts, etc. Right. So and how essentially do you get like all this

02:26.600 --> 02:34.760
together? And here is basically where briefly we want to introduce the tools that we're looking at.

02:34.760 --> 02:41.080
So I work at research space. We build our space. It's an open source research platform for

02:41.080 --> 02:46.440
institutional research data management. So if you look at the core of this very busy image here,

02:46.440 --> 02:50.280
there's an electronic lab node book and an inventory management system in there,

02:51.000 --> 02:56.200
which helps researchers in the active phase of like sample preparation and primary data

02:56.200 --> 03:02.040
preparation to document what they did and what came out of it. What differentiates us from

03:02.040 --> 03:07.320
other node book tools and inventory management tools is that we're very cool interoperable

03:07.320 --> 03:11.960
with other research tools and research infrastructure, such as institutional file stores,

03:12.280 --> 03:18.840
data management planning tools, etc. And essentially our space can through these integrations

03:18.840 --> 03:25.560
can essentially become a hub of recording the scientific progress. So and with that I'll give it

03:25.560 --> 03:36.440
over to you. Thank you. So about Galaxy, we are basically open source data analysis platform

03:36.520 --> 03:45.160
on the web. It originally started from the field of bioinformatics, but over time it has been

03:45.880 --> 03:51.640
expanding to other scientific algorithms such as climate science, ecology,

03:51.640 --> 03:56.840
came from mathematics, imaging data science, materials science, astronomy, and most recently,

03:56.840 --> 04:05.640
even humanities. I put a very simple example there of like an extremely simple analysis that you

04:05.640 --> 04:14.520
could do imaging data that is basically counting the amount of points that you have on this

04:14.520 --> 04:27.160
microscope image where you can see. Okay, great. Where you can see basically a stain cells,

04:28.200 --> 04:35.000
but most typically you don't carry out this kind of analysis, but you use Galaxy in conjunction

04:35.000 --> 04:41.400
with an HPC computing network and you run computationally heavy and more complex analysis. For

04:41.400 --> 04:47.880
example RNAsake or I don't know, protein folding or all these sort of analysis that you do in

04:47.880 --> 04:55.080
bioinformatics. You can use the platform via web browser via an API and most recently we also have

04:55.080 --> 05:04.520
an MCP server, so if you want to connect AI to it, that's also possible. So the design or

05:04.520 --> 05:13.240
Galaxy is same that researchers and it focuses on accessibility and transparency. The design

05:13.240 --> 05:20.520
is let's say the most two most important concepts are histories, which are a sequence of

05:20.520 --> 05:26.520
no code and reproducible transformations that you apply to data sets and those are carried out

05:26.520 --> 05:31.400
by the so-called Galaxy tools which are light wrappers around existing bioinformatic tools.

05:33.560 --> 05:39.160
Then the next logical steps are the workloads, which are recipes that spawn a history from

05:40.360 --> 05:47.080
a set of inputs. Basically, you have some dynamic control flow as well on top of that and they

05:47.160 --> 05:54.120
can be created using the workflow editor, which is GUI where you can basically rearrange the boxes,

05:54.120 --> 06:00.440
up boxes, connect them and in this way control the flow, you can create the workflows from a

06:00.440 --> 06:09.400
history etc. And then the design is very friendly to the very principles because of some

06:09.400 --> 06:15.640
characteristics. I will mention only some of them like append-only design. You also have version

06:15.800 --> 06:22.280
tools and version workflows. It's possible to publish everything you have seen some links around

06:22.280 --> 06:27.400
my presentation. This is because I've been publishing the analysis and the workload. You can just

06:27.400 --> 06:36.680
click there and you will have access to it. You can also export workflows and histories and the

06:36.680 --> 06:41.560
platform can interpret with many storage systems and compute and you can even bring your own

06:42.200 --> 06:53.480
for your user. Of course, this history system has built-in provenance and that's about it.

06:53.480 --> 07:00.200
The last part is about what is Galaxy. Of course, also a community, we would be nothing without

07:00.200 --> 07:08.200
our users because a community maintains some critical parts or infrastructure for Galaxy. One is

07:08.200 --> 07:13.560
a toolset, which is a public repository of tools, contributed by the Galaxy community, which has

07:13.560 --> 07:21.000
over 10k tools. Then there is also the Galaxy training network, which is a large collection of

07:21.000 --> 07:27.160
tutorials that are also contributed by the community. You just visit this URL and you have access to

07:27.160 --> 07:33.960
a wide range of tutorials in different scientific domains and also in Galaxy development. So

07:34.840 --> 07:42.920
administering a server or developing tools. Then the community sensors aim at specific research

07:42.920 --> 07:49.800
areas. Many of the things have their own subdomain, so they can have their own list of tools and so on.

07:51.000 --> 07:57.000
Finally, but most important are the public Galaxy servers that can be accessed by anyone and are

07:57.080 --> 08:06.440
free to use. I personally maintain the use Galaxy.eu server, but there's also a large US server

08:06.440 --> 08:11.800
and an Australian server and also a French one, but more and in the making. And just listing,

08:11.800 --> 08:18.280
let's say the ones that have existing for a longer time, but we are incorporating, for example,

08:18.280 --> 08:27.160
a Belgian server and so on. So with this, I hand back over to Tilo. Although you will have back

08:27.160 --> 08:33.960
to me. Thanks. Yeah. So essentially, like you've seen, we have now two tools and we want to use the

08:33.960 --> 08:40.280
best of both worlds. So in Galaxy, you have user-friendly access to computational workflows.

08:40.280 --> 08:45.720
In our space, it's basically a documentation hub, right? That also then can put your data

08:46.520 --> 08:52.200
once you're done with it elsewhere later. So we thought about, like, how can we use these two things

08:52.200 --> 09:02.600
to streamline workflows and help researchers like Reloads? And so essentially, so essentially,

09:03.880 --> 09:11.880
from the Galaxy side, we incorporated our space as a file store. This means that you can mount

09:11.960 --> 09:19.400
our spaces are repository in Galaxy. And you can therefore import or export data sets from our

09:19.400 --> 09:26.680
space or to our space. You can also export the whole histories themselves. You can also export workflows

09:26.680 --> 09:33.480
as arrow crates. And the setup process is quite simple. Basically, you go to your user preferences

09:33.480 --> 09:39.080
and you will have a screen to connect your own instance or whatever our space instance that you want.

09:39.080 --> 09:47.800
And then, basically, after that, you have it available as a repository to import and export files

09:47.800 --> 09:56.840
and browsing. So. Right. And then we thought we wanted to make it, like, also more user-friendly

09:56.840 --> 10:04.280
from the our space side. So we created a small fordance where researchers can from a document that

10:04.280 --> 10:09.480
they have in our space that has data connected to it. Basically, create a new history in Galaxy

10:09.480 --> 10:16.360
and send that data over automatically. Automatically, also there's some metadata being pushed

10:16.360 --> 10:24.200
over like unique identifiers of the file and the data in our space. And then as soon as on the

10:24.200 --> 10:29.720
Galaxy side, the workflow is invoked, the user in our space can actually keep track of what's going on

10:29.720 --> 10:35.640
when it's done. And then when it's done, they can put the data back into their our space document.

10:36.360 --> 10:44.120
Right. And yeah. And so, who says that, like this, these exports can be informal for

10:44.120 --> 10:52.120
crates, right? Or bio-computer objects. And last but not least, if we have a little bit of time,

10:52.200 --> 11:00.200
we have a video, how this looks like. Not sure if the sound will work.

11:06.200 --> 11:08.920
Oh, yes. That looks good. Yeah.

11:08.920 --> 11:12.280
The video doesn't string. Let me test this out. We haven't seen the information.

11:12.280 --> 11:16.600
There's seamless data analysis and documentation on it for us. Well, automatically,

11:16.600 --> 11:23.400
keeping provenance from primary data to analysis results. Starting in our space, researchers

11:23.400 --> 11:29.240
can select data attached to their documents like this aerial forest photograph and upload it directly

11:29.240 --> 11:37.800
to Galaxy with one click. Our space automatically creates a new Galaxy history with systematic

11:37.800 --> 11:48.920
naming based on the our space document that data was attached to. Additionally, our space adds

11:48.920 --> 11:54.120
metadata linking back to the original experimental documentation and data in the annotation of

11:54.120 --> 12:04.520
the transfer to files. In Galaxy, researchers can now run any available workflow on their data.

12:04.760 --> 12:18.920
Here, we're using more noise segmentation to identify and count individual tree grounds in the forest image.

12:27.320 --> 12:32.600
Back in our space, researchers can track the progress of their analysis in Galaxy by inspecting

12:32.680 --> 12:42.520
the workflow status. When complete, the researcher can use the direct links to navigate to the specific

12:42.520 --> 12:57.160
invocation and its results in Galaxy. From the invocation view, researchers that have set up our space

12:57.160 --> 13:02.760
as a file source of Galaxy can export complete workflow packages directly back to a

13:02.760 --> 13:05.960
allocation of their choice in the our space gallery.

13:20.120 --> 13:25.640
Alternatively, they can select specific result files to be transferred back to our space gallery

13:25.640 --> 13:29.640
using the send data tool.

13:44.760 --> 13:50.360
Finally, results can be efficiently integrated directly into the original our space document,

13:50.360 --> 13:56.760
maintaining full prominence from experimental data through computational analysis and making it available

13:56.760 --> 14:00.280
to our space digital system of integrated tools and services.

14:07.960 --> 14:13.640
This integration is available now with our space version 1.13 and Galaxy 254.

14:13.640 --> 14:24.760
That was that. We also want to conclude. We have a couple of links to the Galaxy community

14:24.760 --> 14:31.000
and also to the our space community resources. We actually also have an open office hour coming

14:31.000 --> 14:37.560
up next week. You're all invited to join if you want. With that, any final words from you?

14:38.520 --> 14:42.520
Oh yeah, thanks for your attention.

14:47.160 --> 14:52.440
So unfortunately, we do not have time for questions but we do have a matrix space which you can go

14:52.440 --> 14:57.080
to ask questions and you should post about the office hours and also the in-of-go-hack

14:57.080 --> 15:03.000
from coming up as well. People who want to leave please do leave. Let other people in. Thank you very much.

