Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ome zarr writer v2 #48

Merged
merged 28 commits into from
Jul 17, 2024
Merged

Feature/ome zarr writer v2 #48

merged 28 commits into from
Jul 17, 2024

Conversation

toloudis
Copy link
Contributor

@toloudis toloudis commented Jul 3, 2024

Link to Relevant Issue

Resolves #46

We have been developing a new ome-zarr-writer that works better for our long time series data. We want to bring that code into this repo and make it a part of bioio.

Description of Changes

Add new ome zarr writer. Do some code cleanup, and make sure it is well commented and documented.
Add unit tests for it.

This PR is intended to be like an "initial add", bringing the new code in without any functional changes, apart from code formatting and commenting, so that it can be a direct replacement in our ome-zarr-conversion codebase.

.pre-commit-config.yaml Show resolved Hide resolved
Comment on lines 5 to 12
from .ome_zarr_writer import OmeZarrWriter as OmeZarrWriter_old
from .ome_zarr_writer_2 import OmeZarrWriter

__all__ = [
"OmeTiffWriter",
"OmeZarrWriter",
"OmeZarrWriter_old",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is our old reader better for large Z stack data while this one is better for timeseries or is this one just better overall?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. (Note this is still a draft PR and I am not expecting that this _old suffix is really best. It's less of a breakage to call the new one OmeZarrWriter2 and leave the old one with same name.)

Here are some of the key differences:

  • The new one is built on tooling for computing multiscales that is more stable than what the old one uses
  • The new one works better for bigger data because it does things in smaller steps. There is no one one-size-fits-all "write this data array to this url" function in the new writer -- at least not yet.
  • The new code is coming from another repo where it's been used to do some bulk data conversions internally already.
  • The new one is currently providing functions that expect to deal with time series of zstacks where the zstacks still might reasonably fit in memory.
  • The new one is currently assuming 5D data coming from bioio.BioImage, or sequences of images (one per T), or (in progress) a 5D ArrayLike.

@toloudis toloudis marked this pull request as ready for review July 8, 2024 21:11
@toloudis toloudis requested a review from a team as a code owner July 8, 2024 21:11
"numpy>=1.21.0,<2.0.0",
"ome-types[lxml]>=0.4.0",
"ome-zarr>=0.6.1",
"semver>=3.0.1",
"tifffile>=2021.8.30",
"zarr>=2.6.0,<3.0.0",
# this pin <2.18.0 should be temporary and is maybe related to https://github.com/zarr-developers/zarr-python/issues/1891
"zarr>=2.6.0,<2.18.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sadly, without this version pin, unit tests were failing. I don't know what the fix is but we should probably have a "reminder" issue to try to unpin it again later.

channel_colors=my_channel_colors,
)
writer.write_metadata(meta)
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above example is the typical usage/code flow of how to use this writer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also resolve the desire to be able to append new data to the zarr store over time? I.e. while a microscope is taking a timelapse, can we store each timepoint as they come in?

If its not possible, no worries, if it is possible, I would say that is worth highlighting

Copy link
Contributor Author

@toloudis toloudis Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code isn't made for modifying an existing one, I believe it's only been tested for when you are ready to write everything all at once... the use case of converting pre-existing files in other formats. But it might work in the scenario you describe! Part of the workflow does involve knowing the total shape ahead of time...

# Check tests pass on multiple Python and OS combinations
test:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
python-version: [3.9, "3.10", "3.11"]
os: [ubuntu-latest, macOS-latest, windows-latest]
python-version: [3.9, "3.10", "3.11", "3.12"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. I believe the quotations are only necessary for versions with a trailing zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove em - I was following the pre-existing pattern which already had 3.11 in quotes.

# Rechunk the input blocks so that the factors achieve an output
# blocks size of full numbers.
better_chunksize = tuple(
np.maximum(1, np.round(np.array(image.chunksize) * factors) / factors).astype(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Different methods of finding maximum throughout file

Copy link
Contributor Author

@toloudis toloudis Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to modify this one in this PR as it is very specific for the downsampling code and is borrowed from a trusted implementation

omero = {
"id": 1, # ID in OMERO
"name": image_name, # Name as shown in the UI
"version": "0.4", # Current version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this versioning from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to make this a constant at the top of the file for easier checking of what omero version our writer uses via import rather than loading of file metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to be the ome-ngff spec version number. I agree I can hoist it up to a symbolic constant but it is also potentially intimately connected to the names in this exact dict.

# let's start by just mandating that chunks have to be no more than
# 1 T and 1 C
chunk_size = (1, 1, shape[2], shape[3], shape[4])
while prod(chunk_size) * itemsize > memory_target:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified to

while prod(chunk_size) * itemsize > memory_target:
    chunk_size = tuple(max(cs // 2, 1) for cs in chunk_size)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, mypy hates that due to the enforcement of 5D tuples here.



def _pop_metadata_optionals(metadata_dict: dict) -> dict:
for ax in metadata_dict["axes"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe
metadata_dict["axes"] = [ax for ax in metadata_dict["axes"] if ax.get("unit") is not None]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I did that it would change the behavior of this code. your suggested change is to only include every axis that has a unit. the point of this code is to go through each "axis" and remove the "unit" key if it happened to have a None value. This function is called "pop metadata optionals" - it's just trying to delete keys that aren't holding meaningful data and the spec says are optional.

List of shapes of all nlevels.
"""
shapes = [lvl0shape]
for i in range(nlevels - 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

for _ in range(nlevels - 1):
  shapes.append(tuple(max(int(s / sc), 1) for s, sc in zip(shapes[-1], scaling)))

Copy link
Contributor

@BrianWhitneyAI BrianWhitneyAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Nice work!

Copy link
Contributor

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally all looks good to me! Nice work. My biggest nitpick is to store this in a different location rather than calling it "ome_zarr_writer_2" maybe bioio.experimental.writers.ome_zarr_writer?

def resize(
image: da.Array, output_shape: Tuple[int, ...], *args: Any, **kwargs: Any
) -> da.Array:
r"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the r""" at the front here? That usually signifies a regex string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it

omero = {
"id": 1, # ID in OMERO
"name": image_name, # Name as shown in the UI
"version": "0.4", # Current version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to make this a constant at the top of the file for easier checking of what omero version our writer uses via import rather than loading of file metadata

channel_colors=my_channel_colors,
)
writer.write_metadata(meta)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also resolve the desire to be able to append new data to the zarr store over time? I.e. while a microscope is taking a timelapse, can we store each timepoint as they come in?

If its not possible, no worries, if it is possible, I would say that is worth highlighting

@toloudis
Copy link
Contributor Author

Generally all looks good to me! Nice work. My biggest nitpick is to store this in a different location rather than calling it "ome_zarr_writer_2" maybe bioio.experimental.writers.ome_zarr_writer?

Actually I would suggest that this writer code is LESS experimental than the other, since it's been used (from a different repo) to do quite a lot of bulk conversions already.

@toloudis toloudis merged commit d1d98c5 into main Jul 17, 2024
20 checks passed
@toloudis toloudis deleted the feature/ome-zarr-writer_v2 branch July 17, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new ome-zarr-writer
3 participants