Where to store the labels? #5

tischi · 2023-11-11T08:42:46Z

Right now we are having this situation:

Before the analysis:

project_root/
│
├── image.zarr/

After the analysis

project_root/
│
├── image.zarr/
│   ├── labels/
├── blurred_image.zarr/

@BioinfoTongLI is this correct? Maybe not entirely, because the blurred_image may go to a different output_dir?!

Since both the labels and the blurred image are a result of the analysis, from a data management point of view, it does feel weird to me that the labels are an in-place modification of the input image while the blurred image is outside.

I admit however that it is very convenient to have the labels with the image for later visual inspection. One may argue that the blurred image is just some non-interesting intermediate. In fact, probably one would not even save it in real life (and we just did that here for practicing)?!

@joshmoore What is your take on this? I think you said only storing labels in a separate container is possible?

@krokicki What would be your preferred best practice here?

@tibuch Your opinion?

The text was updated successfully, but these errors were encountered:

krokicki · 2023-11-11T13:45:28Z

The best practice is of course to never modify input data. This is easy with pipelines that take data in some format and output processed data in some other format (e.g. CZI->Zarr). In that case I would follow the nf-core standard and have a parameter called outdir which contains all the pipeline output.

However, when your workflow takes a Zarr as input, it seems to me like it should be okay to write back into it (as long as you don't modify existing data sets). This seems like a common usage pattern with NGFF formats: you do some processing and augment the input data set with another data set. And all the tools expect this (Napari, BDV, etc.), so if you write the labels to another container it may not be possible to easily visualize them.

Personally, I would write the blurred image to the same Zarr as another group, since it is just a processed version. This keeps things a little more organized on the file system, and the provenance is more clear:

├── image.zarr/
│   ├── blurred/
│   ├── labels/

There could be a pipeline option like --delete-intermediate=true to clean up the blurred data if it's not generally useful after the labels are computed.

joshmoore · 2023-11-11T16:58:45Z

Barring the fact that there's no provision for "discovering" the blurred directory via the Zarr metadata, @krokicki's take matches mine. In the future, I think it would be good if we can support "writing to new subgroups and mildly updating the metadata". A summary for the current state might look like this:

	Same Zarr	Different Zarr
Labels	Under `labels/` dir Pro: part of spec Con: requires updating metadata	Pro: outside of spec Con: requires something to tie the two filesets together
Other output	Avoid if possible or use bf2raw metadata Pro: keeps everything together Con: no metadata to find the new subgroup	As above.

krokicki · 2023-11-11T18:50:09Z

Okay, so it sounds like the blurred version does belong outside as we have currently, and if it's being deleted anyway then keeping these tied together doesn't matter too much.

But I also share the feeling that @tischi has about this being a little strange from a data management perspective. Maybe one way to mitigate it would be to make it explicit. Write the labels to a separate zarr by default, and have an option that allows writing it to the original zarr: --write-labels-to-input or something like that?

So by default:

input.zarr/ (unmodified)

outdir/
│
├── labels.zarr/
├── blurred_image.zarr/

But with the option:

input.zarr/
├── labels/ (added by pipeline)

outdir/
│
├── blurred_image.zarr/

Just brainstorming.

BioinfoTongLI · 2023-11-12T18:52:50Z

@tischi, yes. That's the current setting. And yes, the output_dir can be anywhere now.
Actually as a user I don't really have preferences about thte strucutre. If there's one, I believe having this option is essential

input.zarr/
├── labels/ (added by pipeline)

but from a FAIR perspective, this makes more sense.

├── input.zarr/
│   ├── blurred/
│   ├── labels/

Since labels are generated from blurred (or any raw data preprocessing) before segmentation

tibuch · 2023-11-13T08:23:40Z

Very interesting discussion. I can see an argument for both approaches. As a user I would find it more convenient to have the individual outputs as individual zarr files:

outdir/
│
├── labels.zarr/
├── blurred_image.zarr/

This would make it easy to test different parameters (e.g. blur-sigmas) since I can just delete the whole directory and I am done with the clean-up.

From a FAIR perspective it would make sense to put every processing output as a sub-group. However, that would require some extra work on the spec. A discussion has started in this issue.

It seems to me that we would almost need two flags. One on the workflow level indicating if the results should be separate zarr-files and one on the task level indicating if the result should be kept after workflow completion. The second flag could be used by a clean-up task to remove intermediate results, which are not required any longer.

tischi added the question Further information is requested label Nov 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where to store the labels? #5

Where to store the labels? #5

tischi commented Nov 11, 2023

krokicki commented Nov 11, 2023 •

edited

Loading

joshmoore commented Nov 11, 2023

krokicki commented Nov 11, 2023

BioinfoTongLI commented Nov 12, 2023

tibuch commented Nov 13, 2023

Where to store the labels? #5

Where to store the labels? #5

Comments

tischi commented Nov 11, 2023

krokicki commented Nov 11, 2023 • edited Loading

joshmoore commented Nov 11, 2023

krokicki commented Nov 11, 2023

BioinfoTongLI commented Nov 12, 2023

tibuch commented Nov 13, 2023

krokicki commented Nov 11, 2023 •

edited

Loading