Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to store the labels? #5

Open
tischi opened this issue Nov 11, 2023 · 5 comments
Open

Where to store the labels? #5

tischi opened this issue Nov 11, 2023 · 5 comments
Labels
question Further information is requested

Comments

@tischi
Copy link
Contributor

tischi commented Nov 11, 2023

Right now we are having this situation:

Before the analysis:

project_root/
│
├── image.zarr/

After the analysis

project_root/
│
├── image.zarr/
│   ├── labels/
├── blurred_image.zarr/

@BioinfoTongLI is this correct? Maybe not entirely, because the blurred_image may go to a different output_dir?!

Since both the labels and the blurred image are a result of the analysis, from a data management point of view, it does feel weird to me that the labels are an in-place modification of the input image while the blurred image is outside.

I admit however that it is very convenient to have the labels with the image for later visual inspection. One may argue that the blurred image is just some non-interesting intermediate. In fact, probably one would not even save it in real life (and we just did that here for practicing)?!

@joshmoore What is your take on this? I think you said only storing labels in a separate container is possible?

@krokicki What would be your preferred best practice here?

@tibuch Your opinion?

@tischi tischi added the question Further information is requested label Nov 11, 2023
@krokicki
Copy link
Contributor

krokicki commented Nov 11, 2023

The best practice is of course to never modify input data. This is easy with pipelines that take data in some format and output processed data in some other format (e.g. CZI->Zarr). In that case I would follow the nf-core standard and have a parameter called outdir which contains all the pipeline output.

However, when your workflow takes a Zarr as input, it seems to me like it should be okay to write back into it (as long as you don't modify existing data sets). This seems like a common usage pattern with NGFF formats: you do some processing and augment the input data set with another data set. And all the tools expect this (Napari, BDV, etc.), so if you write the labels to another container it may not be possible to easily visualize them.

Personally, I would write the blurred image to the same Zarr as another group, since it is just a processed version. This keeps things a little more organized on the file system, and the provenance is more clear:

├── image.zarr/
│   ├── blurred/
│   ├── labels/

There could be a pipeline option like --delete-intermediate=true to clean up the blurred data if it's not generally useful after the labels are computed.

@joshmoore
Copy link
Member

Barring the fact that there's no provision for "discovering" the blurred directory via the Zarr metadata, @krokicki's take matches mine. In the future, I think it would be good if we can support "writing to new subgroups and mildly updating the metadata". A summary for the current state might look like this:

Same Zarr Different Zarr
Labels Under labels/ dir
Pro: part of spec
Con: requires updating metadata
Pro: outside of spec
Con: requires something to tie the two filesets together
Other output Avoid if possible or use bf2raw metadata
Pro: keeps everything together Con: no metadata to find the new subgroup
As above.

@krokicki
Copy link
Contributor

Okay, so it sounds like the blurred version does belong outside as we have currently, and if it's being deleted anyway then keeping these tied together doesn't matter too much.

But I also share the feeling that @tischi has about this being a little strange from a data management perspective. Maybe one way to mitigate it would be to make it explicit. Write the labels to a separate zarr by default, and have an option that allows writing it to the original zarr: --write-labels-to-input or something like that?

So by default:

input.zarr/ (unmodified)
outdir/
│
├── labels.zarr/
├── blurred_image.zarr/

But with the option:

input.zarr/
├── labels/ (added by pipeline)
outdir/
│
├── blurred_image.zarr/

Just brainstorming.

@BioinfoTongLI
Copy link
Contributor

@tischi, yes. That's the current setting. And yes, the output_dir can be anywhere now.
Actually as a user I don't really have preferences about thte strucutre. If there's one, I believe having this option is essential

input.zarr/
├── labels/ (added by pipeline)

but from a FAIR perspective, this makes more sense.

├── input.zarr/
│   ├── blurred/
│   ├── labels/

Since labels are generated from blurred (or any raw data preprocessing) before segmentation

@tibuch
Copy link

tibuch commented Nov 13, 2023

Very interesting discussion. I can see an argument for both approaches. As a user I would find it more convenient to have the individual outputs as individual zarr files:

outdir/
│
├── labels.zarr/
├── blurred_image.zarr/

This would make it easy to test different parameters (e.g. blur-sigmas) since I can just delete the whole directory and I am done with the clean-up.

From a FAIR perspective it would make sense to put every processing output as a sub-group. However, that would require some extra work on the spec. A discussion has started in this issue.

It seems to me that we would almost need two flags. One on the workflow level indicating if the results should be separate zarr-files and one on the task level indicating if the result should be kept after workflow completion. The second flag could be used by a clean-up task to remove intermediate results, which are not required any longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants