Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blosc compressor: numThreads serialization fix #15

Merged
merged 6 commits into from
Oct 31, 2023

Conversation

sbesson
Copy link

@sbesson sbesson commented Sep 25, 2023

Fixes #14

As described in the accompanying issues, changes in 0.4.0 are causing the addition of the key numThreads to the compression map under .zarray when using blosc.

This serialization change causes compatibility issues with the expectations of numcodecs - see https://numcodecs.readthedocs.io/en/stable/blosc.html#numcodecs.blosc.Blosc. A formal specification of the blosc codec and its supported configuration values is available in the Zarr v3 specification https://zarr-specs.readthedocs.io/en/latest/v3/codecs/blosc/v1.0.html#blosc-codec-version-1-0. Although it's not 100% clear whether this dictionary should be considered as extensible, the number of threads used for compression is a writing concern which is fully independent of the reading/decompression mechanism (which might very well used different number of threads) so it seems incorrect to serialize it in the first place.

Tracking down the source of the issue with @melissalinkert, it was found to be introduced in #4 more specifically via the getNumThreads getter method which seems to be serialized under .zarray through the jackson-databind ObjectMapper API. 43cf7f3 proposes to remove the getter which suffices to fix the issue while retaining the initial feature.

Opening for initial feedback, it is pretty clear that both the feature and the blosc serialization logic were missing some minimal unit tests which would be great to introduce as part of this PR so that we don't create inadvertent regressions in the future.

This API has the side-effect of serialization the number of threads
under the numThreads key causing compatibility issues with other
libraries like numcodecs
Copy link
Member

@melissalinkert melissalinkert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected when combined with glencoesoftware/bioformats2raw#203; .zarray no longer contains a numThreads (or nthreads).

One other option might be to keep the getter, and annotate it with @JsonIgnore (https://github.com/FasterXML/jackson-annotations/wiki/Jackson-Annotations#property-inclusion). That seemed to work locally at least (but agreed that more unit tests would be helpful in any case).

@joshmoore
Copy link
Member

  • 👍 for the fix (Thanks, @sbesson!)
  • 👍 for the annotation if it works (Thanks, @melissalinkert) so that downstream users don't break (cc: @pedson) though I am confused why this didn't show up for them...

@sbesson
Copy link
Author

sbesson commented Sep 26, 2023

though I am confused why this didn't show up for them...

As I was wondering why this issue hadn't been reported earlier, I found out the compression constructors are not as strict as in numcodecs. At present, any extra key/value can be passed to the factory method/constructor:

Compressor compressor = CompressorFactory.create("zlib", "level", 4, "foo", "bar");

Support or not for unspecified key/value pairs in the compressor key of the JSON is likely a decision that should be enforced at the Zarr specification level. Based on my reading of the current specs,:

  • the Zarr v2 specification contains minimal information about the compression format and the supported codecs
  • the Zarr v3 specification substantially improves things by providing a official list of the supported codecs and individual specification pages - see https://zarr-specs.readthedocs.io/en/latest/v3/codecs.html
  • neither specification defines whether extra parameters not specified in the Configuration Parameters section MAY, MIGHT NOT or MUST be stored under the compressor key

Given the impact, I would propose to focus the scope of this PR on fixing the serialization issue so that the latest version of the library generates Zarr arrays which metadata is compatible with the assumptions of the other reference libraries.

@sbesson
Copy link
Author

sbesson commented Sep 26, 2023

Last commits should implement @melissalinkert suggestion of using the @JsonIgnore annotation which is definitely more elegant than my removal. They also expand the scope of various unit tests to systematically test all core compressions and add a new unit test that specifically test the serialization/deserialization of blosc-compressed arrays.
a723e05 should fail on top of 0.4.0 but passes with the changes above.

While adding new classes came across some inconsistencies/nomenclature questions:

  • camel case vs snake case for new classes
  • boilerplate and generally copyright for this fork
  • the above points at some missing metadata e.g. organizationName and inceptionYear in the top-level POM in case we would like to update headers across the board

@joshmoore
Copy link
Member

@pedson: a heads up that once the header on the new (test) file is settled, we'll be getting this released unless you have a problem with it.

@joshmoore
Copy link
Member

Starting the 0.4.1 🚋

@joshmoore joshmoore merged commit e7be543 into zarr-developers:main Oct 31, 2023
2 checks passed
@sbesson sbesson deleted the numthreads_serialization_fix branch October 31, 2023 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected keyword argument 'numThreads' when reading Java Zarr with Python
3 participants