Skip to content

Conversation

maxrjones
Copy link
Member

@maxrjones maxrjones commented Aug 9, 2025

I think we need much more work around our handling of codecs (which would get much easier with zarr-developers/zarr-python#3276), but this is a more immediate fix to support parsers that specify any serializer other than the default (e.g., big-endian bytes).

This makes your NetCDF3 example work @rsignell

@rsignell
Copy link
Collaborator

rsignell commented Aug 10, 2025

@maxrjones I ran my workflow and still got gibberish: https://nbviewer.org/gist/rsignell/32f563cfbe83c20ac01aba00190e5912
Did I need to update some other stuff in addition to virtualizarr?
Or perhaps more likely I did something else wrong. 😕

@maxrjones
Copy link
Member Author

@maxrjones I ran my workflow and still got gibberish: nbviewer.org/gist/rsignell/32f563cfbe83c20ac01aba00190e5912 Did I need to update some other stuff in addition to virtualizarr? Or perhaps more likely I did something else wrong. 😕

Yeah, this fix is only for Icechunk. Kerchunk uses Zarr specification 2, which will require separate fix or adapting our Kerchunk writer to use Zarr format 3. I'd be tempted to go with the second option because converting from Zarr V3 to V2 for Kerchunk adds a lot of surface area for potential bugs, but that would be a separate PR.

@maxrjones maxrjones marked this pull request as ready for review August 11, 2025 22:20
@maxrjones maxrjones requested a review from a team August 11, 2025 22:26
@rsignell
Copy link
Collaborator

@maxrjones cool, I wanted to use icechunk anyway! And as you said, it does solve my use case:
https://nbviewer.org/gist/rsignell/480cf6ac5142ce2b828199bcc601a93e

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to check my understanding, the reason this failed was because the default array->bytes codec (i.e. the "serializer") in zarr v3 terms was for little endian? Which is fine until you use it to decode big-endian data, when it will give you gibberish?

@maxrjones
Copy link
Member Author

So to check my understanding, the reason this failed was because the default array->bytes codec (i.e. the "serializer") in zarr v3 terms was for little endian? Which is fine until you use it to decode big-endian data, when it will give you gibberish?

That's right.

We still likely have bugs because a user can define their own codec that doesn't subclass from Zarr python. In that case, their custom codec will be dropped by

def extract_codecs(
codecs: CodecPipeline,
) -> DeconstructedCodecPipeline:
"""Extracts various codec types."""
arrayarray_codecs: tuple[ArrayArrayCodec, ...] = ()
arraybytes_codec: ArrayBytesCodec | None = None
bytesbytes_codecs: tuple[BytesBytesCodec, ...] = ()
for codec in codecs:
if isinstance(codec, ArrayArrayCodec):
arrayarray_codecs += (codec,)
if isinstance(codec, ArrayBytesCodec):
arraybytes_codec = codec
if isinstance(codec, BytesBytesCodec):
bytesbytes_codecs += (codec,)
return (arrayarray_codecs, arraybytes_codec, bytesbytes_codecs)
if they use create_v3_array_metadata. This may be the cause of #770.

I expect a lot of our codec handling will get much easier with @d-v-b's refactor in Zarr-Python, which is why I have been proposing small patch fixes rather than more fundamental changes.

@TomNicholas
Copy link
Member

their custom codec will be dropped by

Could we at least raise in that scenario?

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a changelog entry and this is good to go, thank you!

@maxrjones
Copy link
Member Author

their custom codec will be dropped by

Could we at least raise in that scenario?

yeah we definitely should raise in that scenario.

@TomNicholas TomNicholas added the Icechunk 🧊 Relates to Icechunk library / spec label Aug 13, 2025
@TomNicholas TomNicholas merged commit a8e0ac7 into zarr-developers:main Aug 13, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Icechunk 🧊 Relates to Icechunk library / spec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ArrayBytesCodec not included in Icechunk serialization

3 participants