Blosc compressor: numThreads serialization fix #15

sbesson · 2023-09-25T16:24:10Z

Fixes #14

As described in the accompanying issues, changes in 0.4.0 are causing the addition of the key numThreads to the compression map under .zarray when using blosc.

This serialization change causes compatibility issues with the expectations of numcodecs - see https://numcodecs.readthedocs.io/en/stable/blosc.html#numcodecs.blosc.Blosc. A formal specification of the blosc codec and its supported configuration values is available in the Zarr v3 specification https://zarr-specs.readthedocs.io/en/latest/v3/codecs/blosc/v1.0.html#blosc-codec-version-1-0. Although it's not 100% clear whether this dictionary should be considered as extensible, the number of threads used for compression is a writing concern which is fully independent of the reading/decompression mechanism (which might very well used different number of threads) so it seems incorrect to serialize it in the first place.

Tracking down the source of the issue with @melissalinkert, it was found to be introduced in #4 more specifically via the getNumThreads getter method which seems to be serialized under .zarray through the jackson-databind ObjectMapper API. 43cf7f3 proposes to remove the getter which suffices to fix the issue while retaining the initial feature.

Opening for initial feedback, it is pretty clear that both the feature and the blosc serialization logic were missing some minimal unit tests which would be great to introduce as part of this PR so that we don't create inadvertent regressions in the future.

This API has the side-effect of serialization the number of threads under the numThreads key causing compatibility issues with other libraries like numcodecs

melissalinkert

Works as expected when combined with glencoesoftware/bioformats2raw#203; .zarray no longer contains a numThreads (or nthreads).

One other option might be to keep the getter, and annotate it with @JsonIgnore (https://github.com/FasterXML/jackson-annotations/wiki/Jackson-Annotations#property-inclusion). That seemed to work locally at least (but agreed that more unit tests would be helpful in any case).

joshmoore · 2023-09-25T17:55:07Z

👍 for the fix (Thanks, @sbesson!)
👍 for the annotation if it works (Thanks, @melissalinkert) so that downstream users don't break (cc: @pedson) though I am confused why this didn't show up for them...

sbesson · 2023-09-26T08:17:07Z

though I am confused why this didn't show up for them...

As I was wondering why this issue hadn't been reported earlier, I found out the compression constructors are not as strict as in numcodecs. At present, any extra key/value can be passed to the factory method/constructor:

Compressor compressor = CompressorFactory.create("zlib", "level", 4, "foo", "bar");

Support or not for unspecified key/value pairs in the compressor key of the JSON is likely a decision that should be enforced at the Zarr specification level. Based on my reading of the current specs,:

the Zarr v2 specification contains minimal information about the compression format and the supported codecs
the Zarr v3 specification substantially improves things by providing a official list of the supported codecs and individual specification pages - see https://zarr-specs.readthedocs.io/en/latest/v3/codecs.html
neither specification defines whether extra parameters not specified in the Configuration Parameters section MAY, MIGHT NOT or MUST be stored under the compressor key

Given the impact, I would propose to focus the scope of this PR on fixing the serialization issue so that the latest version of the library generates Zarr arrays which metadata is compatible with the assumptions of the other reference libraries.

sbesson · 2023-09-26T14:42:56Z

Last commits should implement @melissalinkert suggestion of using the @JsonIgnore annotation which is definitely more elegant than my removal. They also expand the scope of various unit tests to systematically test all core compressions and add a new unit test that specifically test the serialization/deserialization of blosc-compressed arrays.
a723e05 should fail on top of 0.4.0 but passes with the changes above.

While adding new classes came across some inconsistencies/nomenclature questions:

camel case vs snake case for new classes
boilerplate and generally copyright for this fork
the above points at some missing metadata e.g. organizationName and inceptionYear in the top-level POM in case we would like to update headers across the board

joshmoore · 2023-10-02T13:16:35Z

@pedson: a heads up that once the header on the new (test) file is settled, we'll be getting this released unless you have a problem with it.

joshmoore · 2023-10-31T08:21:59Z

Starting the 0.4.1 🚋

sbesson added 2 commits September 25, 2023 16:57

Bump version to 0.4.1-SNAPSHOT

811d5ca

Remove getter for number of threads from Compressor

43cf7f3

This API has the side-effect of serialization the number of threads under the numThreads key causing compatibility issues with other libraries like numcodecs

sbesson requested review from joshmoore and melissalinkert September 25, 2023 16:24

melissalinkert reviewed Sep 25, 2023

View reviewed changes

sbesson added 4 commits September 26, 2023 12:09

CompressorFactoryTest: add tests for blosc configuration parameters

a2f0b59

ZarrArrayDataReaderWriterTest_2D: test all compressions

9424a68

Add new tests for compression serialization/deserialization

a723e05

Restore getNumThreads public getter and decorate with @JsonIgnore

f7f35f9

melissalinkert approved these changes Sep 26, 2023

View reviewed changes

joshmoore mentioned this pull request Oct 31, 2023

Unexpected keyword argument 'numThreads' when reading Java Zarr with Python #14

Closed

joshmoore merged commit e7be543 into zarr-developers:main Oct 31, 2023
2 checks passed

sbesson deleted the numthreads_serialization_fix branch October 31, 2023 08:27

joshmoore mentioned this pull request Oct 31, 2023

Bump dev.zarr:jzarr to 0.4.2 ome/ZarrReader#66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blosc compressor: numThreads serialization fix #15

Blosc compressor: numThreads serialization fix #15

sbesson commented Sep 25, 2023

melissalinkert left a comment

joshmoore commented Sep 25, 2023

sbesson commented Sep 26, 2023

sbesson commented Sep 26, 2023

joshmoore commented Oct 2, 2023

joshmoore commented Oct 31, 2023

Blosc compressor: numThreads serialization fix #15

Blosc compressor: numThreads serialization fix #15

Conversation

sbesson commented Sep 25, 2023

melissalinkert left a comment

Choose a reason for hiding this comment

joshmoore commented Sep 25, 2023

sbesson commented Sep 26, 2023

sbesson commented Sep 26, 2023

joshmoore commented Oct 2, 2023

joshmoore commented Oct 31, 2023