Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempts to create a writer #69

Merged
merged 26 commits into from
Mar 24, 2021
Merged

attempts to create a writer #69

merged 26 commits into from
Mar 24, 2021

Conversation

glyg
Copy link
Contributor

@glyg glyg commented Dec 4, 2020

This code base does not offer a clear way to create zarr images from scratch (appart for the function in the data module). Although bioimage2raw works fine for already existing data, at some point data will be recorded directly in zarr, especially in custom microscope setups.

So, following a discussion with Josh on image.sc, here is an attempt at a write_image function.

This is very early, in particular I think the ZarrLocation classes should be adapted to writing and better manage paths and groups.

Thanks for any input!

@glyg
Copy link
Contributor Author

glyg commented Dec 4, 2020

looks like I broke some references, let me do this again ^^'

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @glyg. Generally looks good. Let me know if you want to try the integration into ome_zarr.io or not. I like the metadata argument 👍 The byte_order argument may go through a refactoring soon as we loosen the dimension order early next year, but this makes good sense for the moment.

I should have the github actions happy again after lots of work in #51. I'll try to push something today to make it go green.

Cheers.

"""
zarr_location = parse_url(path)

## I assume this should be dealt with in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think what's needed first is a way to pass mode through to the parse_url and *Location classes.


root = zarr.group(store)
if group is not None:
grp = root.create_group(group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require_group should do what you want.

@joshmoore
Copy link
Member

Pushed commits to:

  • merge master to disable test_napari.py for zarr-dev
  • fix the output of pre-commit run -a

@joshmoore
Copy link
Member

Thanks for the update, @glyg. Do you want to run pre-commit run -a and push, or shall I? (I can recommend running pre-commit install in any repo that requires it)

@glyg
Copy link
Contributor Author

glyg commented Dec 7, 2020

Hi @joshmoore I did not know pre-commit I'll apply it. I'm trying to integrate the ZarrLocation mechanisms to the writer, will push (and run pre-commit before :) once I have something functionnal

@joshmoore
Copy link
Member

Sounds great! Thanks again, @glyg.

@glyg
Copy link
Contributor Author

glyg commented Dec 7, 2020

Ok, I have some issues

First pre-commit fails but I don't know how to make mypy happy

Second, when it seems that when only OMERO specs are found by the reader, node.data is not set (although the array is there).
I'm not sure whether this is intended or how to manage it.

Finally I assume the writer should be coded as an object, but I'm not sure how.

@joshmoore
Copy link
Member

First pre-commit fails but I don't know how to make mypy happy

ome_zarr/writer.py:52: error: Argument "zarr" to "Node" has incompatible type "Optional[BaseZarrLocation]"; expected "BaseZarrLocation"

This means that the variable somewhere could conceivably be None. You can either perform the if zarr == None test, or make sure it's never None.

ome_zarr/writer.py:65: error: Incompatible types in assignment (expression has type "Tuple[Any, ...]", variable has type "Union[Tuple[int], int, None]")

You might want to create a new variable like _chunks: Tuple[int] = ... then the "compiler" should know what you're trying to do.

@glyg
Copy link
Contributor Author

glyg commented Dec 11, 2020

  • I had to tell the OMERO spec to put the data in the node at read time, I'm not sure it does not break anything (although the tests pass)

  • Is a class needed for the writer?

  • random channel colors might be ugly

@joshmoore
Copy link
Member

random channel colors might be ugly

Seems like that can be handled in the future.

Is a class needed for the writer?

I was wondering about that. I don't think it's a MUST. Did you find any state that you would have felt useful to keep in such an object?

I had to tell the OMERO spec to put the data in the node at read time, I'm not sure it does not break anything

I'll take a look.

@joshmoore
Copy link
Member

I had to tell the OMERO spec to put the data in the node at read time,

Do you think this is a problem with the order of operations? i.e. if the data nodes ran before the metadata nodes?

@glyg
Copy link
Contributor Author

glyg commented Dec 16, 2020

HI Josh,

Do you think this is a problem with the order of operations?

Yes, it is not clear to me when data is appended to node.data. If you have pyramid images, then the Multiscale spec will access the data, but the OMERO spec will also be present. If it's always accessed last, then testing if node.data is empty as I did should be enough?

Did you find any state that you would have felt useful to keep in such an object?

The scenario I was thinking might justify a Writer class instance is if you need to append children / nodes at data generation time. e.g. during the course of an acquisition. Then maybe some kind of object persistence would be interesting.

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @glyg. I just tried this out while trying to come up with an example for @oeway, do you mind if I push a few commits for your review? Alternatively, do you have any changes that are coming yourself?



def write_image(
path: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine passing a zarr.Group here rather than path+name+group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense (I realize the group argument here is not used).
Do you think both way of specifying the output should co-exist or it's better to leave the job of creating the group to the user all the time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with leaving it to the user since there are so many options (local v. remote, etc.) but if we realize we're doing the same thing over and over again, we'll provide a helper.

Copy link
Contributor Author

@glyg glyg Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable :)

image: np.ndarray,
name: str = "0",
group: str = None,
chunks: Union[Tuple[int], int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is to handle downsampling internally, I wonder if we want to pass multiple chunk sizes or keep them consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to me it's easier to keep the same chunks for all sample sizes. Or scale the number of chunks - maybe in the x, y dimensions - to have a consistent chunk size across scales? I'd assume the specified chunk size is for the 1:1 scale.

It sounds complicated to ask the user to input one chunk size for each scale - imho it should be as transparent as possible.

Maybe it will get clearer when write_multiscales is integrated here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had gotten that far ;) I think you're right that it's more difficult than one would want in general. Is it worth though providing all the options, i.e.

Union[Tuple[Tuple[int]], Tuple[int], int]

image = image.reshape(shape_5d)

if chunks is None:
image = da.from_array(image)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no reshaping is necessary, it might be preferred to leave the input array as is and use a convenience zarr function to improve speed, but I'd need to do some testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let you the judge of that :)

]
omero["rdefs"] = {"model": "color"}

metadata["omero"] = omero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently missing the multiscales metadata. I'd propose moving write_multiscales from ome_zarr.data to your writer module.

@glyg
Copy link
Contributor Author

glyg commented Jan 8, 2021

Hi @glyg. I just tried this out while trying to come up with an example for @oeway, do you mind if I push a few commits for your review? Alternatively, do you have any changes that are coming yourself?

Hi @joshmoore of course feel free to push on this PR. I was actually waiting for your feedback before pursuing this, so there is nothing coming from me.
G.

@glyg
Copy link
Contributor Author

glyg commented Jan 11, 2021

Hi @joshmoore - finally, should I apply the changes you suggest? I can do that today or early tomorrow.

Best,

Guillaume

@joshmoore
Copy link
Member

Hi Guillame, sorry meetings have run long today. 🙂 Let me open a PR with my work-in-progress and you can give me feedback, trash it, etcl. ~J.

Goal is to re-use write_multiscale previously from ome_zarr.data
and to remove hard-coding of the storage logic by passing a pre-
existing zarr.Group.
@@ -189,7 +193,7 @@ def get_json(self, subpath: str) -> JSONDict:
return {}


def parse_url(path: str) -> Optional[BaseZarrLocation]:
def parse_url(path: str, mode: str = "r") -> Optional[BaseZarrLocation]:
Copy link
Contributor Author

@glyg glyg Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this function is not used in the writer module, maybe I should revert the changes in io.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on a simple example for Wei (his use case was "how to turn an existing zarr array into an ome-zarr"). Let's keep this until we have a simple working user example. Maybe parse_url can still be the starting point.

previous example
#!/usr/bin/env python
import numpy as np
import zarr as zr

from ome_zarr.io import parse_url
from ome_zarr.reader import OMERO, Reader
from ome_zarr.writer import write_image

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("path")
ns = parser.parse_args()

dtype = np.uint8
mean = 10
shape = (1, 3, 1, 256, 256)
rng = np.random.default_rng(0)
data = rng.poisson(mean, size=shape).astype(dtype)
store = zr.DirectoryStore(ns.path)
grp = zr.group(store)
write_image(data, grp, chunks=(128, 128))

* fix warning on mypy `exclude`
* fix .isort.cfg modification
* temporarily disable blind-except due to false positives
@codecov
Copy link

codecov bot commented Feb 4, 2021

Codecov Report

Merging #69 (f22c564) into master (589cb52) will increase coverage by 0.48%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
+ Coverage   66.37%   66.85%   +0.48%     
==========================================
  Files          10       11       +1     
  Lines        1023     1056      +33     
==========================================
+ Hits          679      706      +27     
- Misses        344      350       +6     
Impacted Files Coverage Δ
ome_zarr/reader.py 58.86% <ø> (ø)
ome_zarr/io.py 77.77% <60.00%> (-2.06%) ⬇️
ome_zarr/writer.py 93.75% <93.75%> (ø)
ome_zarr/data.py 98.64% <100.00%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 729ab5b...f22c564. Read the comment docs.

@joshmoore
Copy link
Member

@glyg very sorry to have let this slipped. Getting back to my example for @oeway:

#!/usr/bin/env python
import numpy as np
import zarr as zr
from ome_zarr.writer import write_image

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("path")
ns = parser.parse_args()

# Set up your numpy, whatever it is
dtype = np.int8
mean = 1000
shape = (1, 3, 10, 4096, 4096)
rng = np.random.default_rng(0)
data = rng.poisson(mean, size=shape).astype(dtype)

# Write it out to a Zarr group
store = zr.DirectoryStore(ns.path)
grp = zr.group(store)
write_image(data, grp, chunks=(128, 128))

looks like the only thing missing was the downsampling. I went ahead and added that in a commit. We may need to change the API slightly to let someone choose not to downsample, e.g. by passing scaler=None or by passing a pre-computed pyramid, but as far as I can tell, things are all working.

If my last commit goes green, let's get this in and start hammering on it!

Thanks for your patience. (Though feel free to kick me in the future.)

Copy link
Member

@will-moore will-moore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Running test locally and inspecting the output looks as expected.
Happy for this to be merged and I'll try using it 'for real'

paths = []
for path, dataset in enumerate(pyramid):
# TODO: chunks here could be different per layer
group.create_dataset(str(path), data=pyramid[path], chunks=chunks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset variable is not used, unless you want to do data=dataset ?

ome_zarr/writer.py Outdated Show resolved Hide resolved
chunks = _retuple(chunks, shape_5d)
omero = metadata.get("omero", {})
# Update the size entry anyway
omero["size"] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to have size metadata here since we have that info in the .zarray and it's not part of the OME-Zarr spec.

omero["channels"] = [
{
"color": "".join(f"{i:02x}" for i in color),
"window": {"start": 0, "end": 1},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to omit all the omero['channels'] metadata if we're not provided with it.
Generating these numbers will cause viewers to assume they are based on the data, which could lead to poor visualisation (e.g. render levels 0 - 1 will likely be saturated)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glyg : any thoughts from your initial use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just tried to emulate the code I saw in the tests or elsewhere, I agree that by default metadata entries should as small as possible

@will-moore
Copy link
Member

In testing this, I adapted omero-cli-zarr to use it for writing - See ome/omero-cli-zarr#58.
Not sure we want to use this approach (see PR) but it seems to be working.

except Exception as e:
raise e
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be there? The LOGGER line below will never be called

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty likely debugging code that crept in. Thanks, @glyg

else:
_chunks = chunks

return (*shape[: (5 - len(_chunks))], *_chunks)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all based on the assumption that we have 5D data, which is being relaxed in ome/ngff#39, should there be a default retuple to 5 or no attempt at all to reshape the data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but let's handle that along with #39. Ultimately I think the rules we'll have to be laid out, the most flexible being "once the shape is defined (e.g. len(shape)==4), then everything else (e.g. chunks) should stay in that shape & order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great!

@joshmoore
Copy link
Member

@will-moore @glyg : pushed a few fixes:

  • Remove the metadata as agreed
  • Fixed the raise e issue
  • Allowed scaler to be set to None

A few things that I think need to happen, but can be outside this PR:

  • Pyramids should be pre-computable, but that conflicts with the dimension detection, so we can do that with the other dimension fixes.
  • I think the Scaler class needs a __call__ method which defaults to using nearest() internally. So then a scaler is basically just a function which takes an array and generates a pyramid.

@will-moore
Copy link
Member

@joshmoore You said "pushed a few fixes", but the last commit I see is b832df5 (1 commit yesterday)?
Am I missing some?

@joshmoore
Copy link
Member

Doh! Sorry, @will-moore. I've made that mistake several times now. Pushed to my writer branch, rather than @glyg's. Commits are there now.

@will-moore
Copy link
Member

Looks good.
Tests are passing etc.
Tried with/without omero metadata and with scaler=None from my omero-cli-zarr PR above. All works as expected.

@joshmoore
Copy link
Member

Whew. Merging 🎉 Thanks for your effort (& patience!) here, @glyg.

@joshmoore joshmoore merged commit c3f641d into ome:master Mar 24, 2021
@glyg
Copy link
Contributor Author

glyg commented Mar 25, 2021

thanks a lot @joshmoore and @will-moore , very happy to be of some help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants