-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFP Compression #117
Comments
It looks like there are already some python/numpy bindings: https://github.com/seung-lab/fpzip |
Hi Ryan, to implement a new codec you just need to implement the
numcodecs.abc.Codec interface:
https://numcodecs.readthedocs.io/en/latest/abc.html
...then register your new codec class with a call to register_codec():
https://numcodecs.readthedocs.io/en/latest/registry.html#numcodecs.registry.register_codec
If you want to just try this out as an experiment, you could just knock up
a codec implementation in a notebook or wherever, using the Python bindings
for zfp the codec implementation would be very simple. I'd suggest getting
some benchmark results showing useful speed and/or compression ratio, then
consider adding to numcodecs if it looks promising.
If this did look a useful addition to numcodecs, couple of things I noticed
from a quick glance at the source for the Python bindings. First it expects
arrays of at least 3 and at most 4 dimensions. Don't know if this is a
constraint might be good to relax. Also it converts data to Fortran order
before compression, which will mean an extra data copy for most users where
data is usually in C order. This may be a hard requirement of the zfp
library, and so unavoidable, but just something to note.
…On Mon, 15 Oct 2018, 16:57 Joe Hamman, ***@***.***> wrote:
It looks like there are already some python/numpy bindings:
https://github.com/seung-lab/fpzip
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/zarr-developers/zarr/issues/307#issuecomment-429911840>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QkUZrTh_hJbm34GeyCQNcHiCPuV3ks5ulLCGgaJpZM4XceDs>
.
|
Digging deeper, it looks like fpzip does not support missing data: https://zfp.readthedocs.io/en/release0.5.4/directions.html#directions This is a dealbreaker for now (at least for me), but it looks like it might be addressed soon. |
How do you handle missing data with Zarr currently? |
@rabernat I'm the author of the fpzip python bindings. I think you may be conflating zfp and fpzip. fpzip is a lossless codec while zfp isn't. Try the following: import fpzip
import numpy as np
x = np.array([[[ np.nan ]]], dtype=np.float32)
y = fpzip.compress(x)
z = fpzip.decompress(y)
print(z) Worth noting that fpzip is for 3D and 4D data though 1D and 2D data compression is supported via adding dimensions of size 1. |
Also, if you guys find fpzip interesting, I'd be happy to add support for more flexible support for Fortran vs C order. I don't think it's a hard constraint, it's just what I got working for my use case. UPDATE: fpzip 1.1.0 respects C order arrays. |
@william-silversmith, thanks for clearing this up. I'm going to blame the mix-up on @jhamman with this comment. Thanks also for writing fpzip and sharing it with the community! I did some benchmarking of fpzip vs. zstd on real ocean data. You can find my full analysis in the this notebook. The results are summarized by these two figures, which compare the compression ratio vs. encoding / decoding time of the two codecs on two different ocean datasets, one with a land mask (i.e. nan's) over about 43% of the domain and one without. It's interesting to note how zstd finds the missing data (encoded as nans) immediately; the compression ratios on the data with the mask are much higher. Since fpzip doesn't allow nans, I just filled with zeros. With fpzip, there are only very minor compression ratio differences between the masked and the unmaksed arrays. Based on this analysis, in terms of decoding time (which is my main interest), fpzip is nowhere close to zstd. To get fpzip to speed up encoding or decoding, I have to go out to precisions of < 20, which results in acceptable losses: The caveat is that this is all done on my laptop, so might not be very robust or representative. Please have a look at my notebook and see if I have done anything obvious wrong. (It should be fully runnable from anywhere.) |
Also, in case this was not clear, I am basically not getting any compression of either dataset with fpzip using the default precision: >>> len(fpzip.compress(data_sst))/data_sst.nbytes
1.0000054012345678
>>> len(fpzip.compress(data_ssh))/data_ssh.nbytes
1.0000054012345678 Am I doing something wrong? |
Hi Ryan! Thanks for the info. I'm glad you found it useful to evaluate I have a few comments that I'll just put into bullets:
I did a small test of some of the pangeo data using the following script and I think I must have injected a bug. I think the reason there's so little compression is because the tail of the compressed data are all zeros. Let me investigate this.... Apologies, this python library is pretty new and we haven't put it into production use yet so it's not 100% battle tested. import fpzip
import numpy as np
import gcsfs
import pandas as pd
import xarray as xr
gcs = gcsfs.GCSFileSystem(project='pangeo-181919', token='anon')
# ds_ssh = xr.open_zarr(gcsfs.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt',
# gcs=gcs))
ds_llc_sst = xr.open_zarr(gcsfs.GCSMap('pangeo-data/llc4320_surface/SST',
gcs=gcs), auto_chunk=False)
data = ds_llc_sst.SST[:5, 1, 800:800+720, 1350:1350+1440].values
x = fpzip.compress(data)
print(x) |
I just updated fpzip to version 1.1.1 that should have the trailing zeros problem fixed. Give it a try again! EDIT: I did a quick test and I saw the following results for the above script:
|
@rabernat, I'm the developer of fpzip (and zfp). Having worked a lot with climate scientists, I would certainly expect high-resolution ocean data like yours to compress quite well (definitely to less than 100%!). @william-silversmith, who has done a great job adding a Python interface to fpzip, reports a reduction to 63% after his latest fix, which seems more plausible. Perhaps you can give fpzip another try? If you're willing to sacrifice some loss to within some acceptable error tolerance, then I would suggest checking zfp out. It's a far faster and more advanced compressor than fpzip. |
Hi @lindstro -- thanks for droping in here. And most of all, thanks for your important work on compression! As a computational oceanographer, I recognize that this is really crucial for the future sustainability of high-resolution modeling. I updated my analysis with the latest version of fpzip. Indeed it actually compresses the data now! 😉 Here is a figure which summarizes the important results. In contrast to my previous figure, I have eliminated the In terms of compression ratio, fpzip beats zstd when there is no land mask (holds regardless of the zstd level). With a land mask, zstd is able to do a bit better. zstd is a lot faster, particularly on decoding, but that is not a dealbreaker for us. (50 MB/s is still fast compared to network transfer times.) Based on this analysis, I think we definitely want to add fpzip support to numcodecs.
Yes we would love to try zfp. Is there a python wrapper for it yet? |
Another point about fpzip: compression ratio is considerably higher (0.4 vs 0.55) if I transpose the arrays from their native python / C order to Fortran order when I feed them in. @william-silversmith -- I notice you have added the |
@alimanfoo - presuming we want to move forward with adding fpzip and zfp to numcodecs, what would be the best path? Would you want to use @william-silversmith's python package, or would you want to re-wrap the C code within numcodecs, as currently done for c-blosc? The latter approach would also allow us to use zfp without an independent python implementation. But it is a lot more work. Certainly not something I can volunteer for. |
If there are existing python wrappers then I'd suggest to use those, at
least as a first pass - can always optimise later if there is room for
improvement. PRs to numcodecs for fpzip and zfp would be welcome.
…On Wed, 7 Nov 2018, 15:23 Ryan Abernathey ***@***.*** wrote:
@alimanfoo <https://github.com/alimanfoo> - presuming we want to move
forward with adding fpzip and zfp to numcodecs, what would be the best
path. Would you want to use @william-silversmith
<https://github.com/william-silversmith>'s python package, or would you
want to re-wrap the C code within numcodecs, as currently done for c-blosc?
The latter approach would also allow us to use zfp without an independent
python implementation. But it is a lot more work. Certainly not something I
can volunteer for.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/zarr-developers/zarr/issues/307#issuecomment-436660574>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QqIqooHdlM0F2h0PHbVDubZlr1kqks5usvsOgaJpZM4XceDs>
.
|
@rabernat, thanks fore rerunning your compression study. fpzip was developed back in 2005-2006, when 5 MB/s was fast. I've been meaning to rewrite and parallelize it, but I've got my hands full with zfp development. Regarding transposition, clearly we'd want to avoid making a copy. fpzip and zfp both use the convention that x varies faster than y, which varies faster than z. So an array of size nx * ny * nz should have a C layout
@william-silversmith's Python wrappers should ideally do nothing but pass the array dimensions to fpzip in the "correct" order and not physically move any data. zfp (but not fpzip) also supports strided access to handle arrays of structs without having to make a copy. As far as Python wrappers for zfp, we're just now starting on that and have an end-of-the-year milestone to deliver such a capability. The developer working on this has suggested using cffi. As I'm virtually Python illiterate, I'm not sure how that would interact with numcodecs. Any suggestions are welcome. |
For numcodecs it doesn't matter how you wrap zfp. As long as the zfp python
module provides a compress() function that accepts any python object that
implements the buffer protocol and returns a python object that implements
the buffer protocol (e.g., bytes), and similar for decompress(). And in
both cases minimises memory copies, i.e., use the buffer protocol to read
data directly from the buffers exposed by the python objects.
…On Wed, 7 Nov 2018, 16:51 Peter Lindstrom ***@***.*** wrote:
@rabernat <https://github.com/rabernat>, thanks fore rerunning your
compression study. fpzip was developed back in 2005-2006, when 5 MB/s was
fast. I've been meaning to rewrite and parallelize it, but I've got my
hands full with zfp development.
Regarding transposition, clearly we'd want to avoid making a copy. fpzip
and zfp both use the convention that x varies faster than y, which varies
faster than z. So an array of size nx * ny * nz should have a C layout
float array[nz][ny][nx];
@william-silversmith <https://github.com/william-silversmith>'s Python
wrappers should ideally do nothing but pass the array dimensions to fpzip
in the "correct" order and not physically move any data. zfp (but not
fpzip) also supports strided access to handle arrays of structs without
having to make a copy.
As far as Python wrappers for zfp, we're just now starting on that and
have an end-of-the-year milestone to deliver such a capability. The
developer working on this has suggested using cffi. As I'm virtually Python
illiterate, I'm not sure how that would interact with numcodecs. Any
suggestions are welcome.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/zarr-developers/zarr/issues/307#issuecomment-436694477>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QgLIbCWX8I2K6mObZKt9Mw_Ucum-ks5usw-YgaJpZM4XceDs>
.
|
Ok, so we are clearly in a "help wanted" situation. Implementing the necessary wrappers / documentation / tests for fpzip and zfp in numcodecs would make an excellent student / intern project. It is a clearly defined task with plenty of examples to draw from. And it will have a big impact by bringing these codecs to a broader community via zarr. So if anyone knows of any students or other suitable candidates looking to get their feet wet in open source, please send them to this issue! Also, I'm transferring this to the numcodecs issue tracker, where it clearly belongs. |
Just to elaborate a little, what we would like to do is implement a codec class for ZFP. The Codec interface is defined here. Once there is a Python package binding ZFP, then the codec implementation is very simple, basically the There is a detail on the One other detail, ideally a zfp Python binding would be able to accept any object exposing the Python buffer interface. We currently also require codecs to be able to handle Hth. |
We can smooth over this a bit. Opened PR ( #119 ) as a proof-of-concept. |
@alimanfoo, zfp does indeed decompress directly into a user-allocated buffer. Similarly, compression writes to a user-allocated buffer whose size zfp conservatively estimates for the user. Alternatively, the user can specify how much memory to use for the compressed data, and zfp will ensure that the compressed data fits within the buffer (with quality commensurate with buffer size). I know very little about zarr, but my understanding is that it partitions arrays into equal-shape chunks whose compressed storage is of variable length. I'm guessing chunks have to be large enough to amortize the overhead of compression metadata (I see 1 MB or so recommended). zfp provides similar functionality but uses very small chunks (4^d values for d-dimensional arrays) that are each (lossily) compressed to a fixed number of bits per chunk (the user specifies how many bits; for 3D arrays, 1024 bits is common). I wonder if this capability could be exploited in zarr without having to rely on zfp's streaming compression interface. |
We've revamped how codecs handle data internally. Outlined in this comment. This should make it much easier to contribute new codecs. Would be great if those interested took a look and provided any feedback. |
@lindstro thanks for the information, very helpful. Apologies if I have not fully understood the details in your comment, but you are right to say that zarr partitions an array into equal-shape chunks, and then passes the data for each chunk to a compression codec (in fact this can be a pipeline of codecs configured by the user, but typically it is just a compressor). The result of encoding the chunk is then stored, and this will be of variable length depending on the data in each chunk. In the zarr architecture, ZFP would be wrapped as a codec, which means it could be used as the compressor for an array. So zarr would pass each chunk in the array to ZFP, and then store whatever ZFP gives it back. Zarr passes all of the data for a chunk to a codec in a single API call, so in general there is no need to use streaming compression, you can just do a one-shot encoding. Typically I find that chunks should be at a minimum 1 MB uncompressed size, usually I find upwards of 16 MB is better, depending somewhat on various factors like compression ratio and type of storage being used. A compressor (or any other type of codec) is a black box as far as zarr is concerned. If a compressor like ZFP chooses to break the chunk down further into smaller pieces, that is an internal implementation detail. E.g., the Blosc compressor does something similar, it breaks down whatever it is given into smaller blocks, so it can then use multiple threads and compress blocks in parallel. If it is possible to vary the size of chunks that ZFP is using internally, then this is an option you'd probably want to expose to the user when they instantiate the ZFP codec, so they could tune ZFP for a particular dataset. Hth. |
@alimanfoo, let me try to clarify. Whereas zarr prefers uncompressed chunks as large as 16 MB, zfp in its fixed-rate mode uses compressed chunks on the order of 8-128 bytes (a cache line or two), which provides for far finer granularity of access. Think of zfp as a compressed floating-point format for SSE vectors and similar. I was just thinking out loud whether a capability like that could be exploited by zarr, for example, if traversing a 3D array a 2D or 1D slice at a time, when only a very small subset of a 16 MB chunk is needed. My reference to zfp streaming compression was meant in the sense of sequentially (de)compressing an entire array (or large portions thereof) in contrast to zfp's inline (de)compression capability, where tiny blocks (4^d scalars in d dimensions) are (de)compressed on demand in response to random access reads or writes to individual array elements. Of course, one could use zfp as a black box codec with zarr to (de)compress large chunks, but my question had to do with whether zarr could benefit from fine-grained access to individual scalars or a few handfuls of scalars, as provided by zfp's own compressed arrays. If not, then the best approach to adding zfp support to zarr would most likely be to use the Python bindings we're currently developing to zfp's high-level C interface, which is designed for infrequent (de)compression of the megabyte-sized chunks that zarr prefers. |
Many thanks @lindstro, very nice clarification.
We had a similar discussion a while back, as the blosc compressor offers something analogous, which is the ability to decompress specific items within a compressed buffer, via a function called The bottom line is that I think it could, in principle, be possible to modify numcodecs and zarr to leverage this kind of feature. However, it would require some reworking of the API layer between zarr and numcodecs and some non-trivial implementation work. I don't have bandwidth to do work in that direction at the moment, but if someone had the time and motivation then they'd be welcome AFAIC. As I understand it, the type of use case that would make this feature valuable are where you are doing a lot of random access operations into small regions of an array, i.e., pulling out individual values or small clusters of nearby values. I don't have any use cases like that personally, but maybe others do, and if so I'd be interested to know. All my use cases involve running parallel computations over large regions of an array. For those use cases, it is fine to decompress entire chunks, as it is generally only a few chunks around the edge of an array or region where you don't need to decompress everything, and the overhead of doing a little extra decompression than strictly needed is negligible. Occasionally I do need to pull out single values or a small sub-region, but that is generally a one-off, so again the extra overhead of decompressing entire chunks is easy to live with.
This sounds like a good place to start. |
Regarding use cases, you'll have to excuse my ignorance of how zarr works and the use cases it was designed for, but here are a few that motivated the design of zfp and its small blocks:
I do agree, however, that it is often possible to structure the traversal and computation in a streaming manner to amortize the latency associated with (de)compression. This, however, typically requires refactoring existing code. |
Thanks @lindstro, very interesting. FWIW Zarr was designed primarily as a storage partner to the Dask array module, which implements a subset of the Numpy interface but as chunked parallel computations. It's also designed to work well either on a single machine (either multi-threaded or multi-process) or on distributed clusters (e.g., Pangeo-like clusters reading from object storage). With Dask, the sweet spot tends to be towards larger chunks anyway, because there is some scheduling overhead associated with each task. So there has not (yet) been any pressure to optimise access to smaller array regions. But as I mentioned above it is conceivable that Zarr could be modified to leverage capabilities of ZFP or Blosc to decompress chunk sub-regions. So if someone has a compelling use case and wants to explore this then please feel free. |
@lindstro, one important point is Zarr supports on "disk" storage (though this could be an object store or actually files in directories) or in-memory storage. It's possible smaller chunk size could make sense for some in-memory Zarr Array use cases. The in-memory format is a dictionary with each chunk in different key-value pairs. Fortunately whatever work goes into adding a codec for Numcodecs could be leveraged easily by either use case. |
Let's just say it would be an interesting challenge 😄. If anyone does decide they want to explore this at some point, I'd recommend having a specific, concrete use case in mind to target and benchmark, including specific computations, specific data, and a specific target platform, including what type of storage, what type of computational nodes, and what (if any) type of parallelism. I'd also recommend making sure it's an important use case, i.e., optimisation would save somebody a significant amount of time and/or money; and doing enough thinking ahead of time to be fairly convinced that the desired performance improvement should in theory be achievable. These suggestions are obvious to everyone I'm sure, but just wouldn't want anyone to sink loads of time for little reward. |
I almost done with zfp codec. Just want to know zfp has four different
mode, such as accuracy, precision, rate and then combination. For
combination mode, it can accept four parameters.
Do I just pass them in the API as compress(buf, mode, accuracy,
precision,rate, minbits, maxbits, maxprec, minexp)?
Thanks,
Haiying
…On Tue, Dec 4, 2018 at 3:16 PM Alistair Miles ***@***.***> wrote:
@rabernat <https://github.com/rabernat>, I agree that chunks serve an
entirely different purpose in zfp and in zarr. As I mentioned to
@alimanfoo <https://github.com/alimanfoo>, it might make sense to exploit
the fact that each zarr chunk is many independent zfp blocks, and to
decompress only those zfp blocks needed when accessing a subset of the
chunk. But I imagine supporting this would likely require major surgery on
zarr.
Let's just say it would be an interesting challenge 😄. If anyone does
decide they want to explore this at some point, I'd recommend having a
specific, concrete use case in mind to target and benchmark, including
specific computations, specific data, and a specific target platform,
including what type of storage, what type of computational nodes, and what
(if any) type of parallelism. I'd also recommend making sure it's an
important use case, i.e., optimisation would save somebody a significant
amount of time and/or money; and doing enough thinking ahead of time to be
fairly convinced that the desired performance improvement should in theory
be achievable. These suggestions are obvious to everyone I'm sure, but just
wouldn't want anyone to sink loads of time for little reward.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#117 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIDyFJdhknshrmw0AbPfMrd4I6pZHQtSks5u1vQhgaJpZM4YTK1M>
.
|
Normally those sorts of things go in the constructor and are set as members of the instance. Take a look at |
@halehawk - a great way to move forward with this would be to just submit a pull request with your code as is. Then we can discuss the details and iterate before merging. |
I did a pull request yesterday.
…Sent from my iPhone
On Dec 22, 2018, at 8:34 AM, Ryan Abernathey ***@***.***> wrote:
@halehawk - a great way to move forward with this would be to just submit a pull request with your code as is. Then we can discuss the details and iterate before merging.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Ah fantastic! Perhaps you could edit your PR description to include the text "Closes #117" in the description? That way it will be linked to this discussion. |
Do you mean I shall submit a new pull request? By the way what is ‘tox’ in the to-do list
…Sent from my iPhone
On Dec 22, 2018, at 8:34 AM, Ryan Abernathey ***@***.***> wrote:
@halehawk - a great way to move forward with this would be to just submit a pull request with your code as is. Then we can discuss the details and iterate before merging.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I just added closes #117 in the description.
…Sent from my iPhone
On Dec 22, 2018, at 9:08 AM, Ryan Abernathey ***@***.***> wrote:
Ah fantastic!
Perhaps you could edit your PR description to include the text "Closes #117" in the description? That way it will be linked to this discussion.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
zfp 0.5.5 has been released, which adds Python bindings to the C API and the ability to (de)compress NumPy arrays. The current release supports compression of arbitrarily strided arrays (including non-contiguous arrays), but does not yet support decompression with arbitrary strides. This release also adds lossless compression to zfp. We are planning on expanding zfp's Python API over the next year or so, for instance, to include Python bindings to zfp's compressed arrays. |
Update: zfpy version 0.5.5 has now been released on conda-forge for Linux and OSX. |
Another update: zfpy version 0.5.5 is now available as a |
I believe the conda-forge version now also supports Windows. |
It looks like ZFP is tightly integrated with Blosc now: https://www.blosc.org/posts/support-lossy-zfp/ It also supports Caterva style sharding out of the box. I am gonna try this, but not sure if a version with this Blosc is in Numcodecs. |
Just tried it and it is not registered in the Numcodec wrapper of Blosc. What can we do to make that available as a part of Blosc in Numcodecs? |
ZFP is one of numcodecs compressor already. What are you looking for?
…On Sun, Sep 4, 2022 at 11:51 AM Altay Sansal ***@***.***> wrote:
It looks like ZFP is tightly integrated with Blosc now:
https://www.blosc.org/posts/support-lossy-zfp/
It also supports Caterva style sharding out of the box.
I am gonna try this, but not sure if a version with this Blosc is in
Numcodecs.
Just tried it and it is not registered in the Numcodecs version. What can
we do to make that available as a part of Blosc in Numcodecs?
—
Reply to this email directly, view it on GitHub
<#117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAPEFFDKEVWQJ4XKDZBAITV4TOQVANCNFSM4GCMVVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am mainly concerned about the Caterva functionality of Blosc. Regardless of the codec. Any idea about that being used? Regarding ZFP being a codec already, would it not make sense to rely on Blosc's ZFP implementation instead of maintaining it in Numcodecs now that it can be considered upstream? |
To my knowledge, Blosc does not support zfp's reversible (lossless) mode, whereas numcodecs does. I'd also expect that numcodecs is used by applications other than Zarr that would benefit from zfp support. |
True, they don't support lossless. |
Maybe the right question I should ask is, would it be possible to support caterva within numcodecs? |
We had a discussion about that in zarr-developers/zarr-python#713. Caterva is a lot more than just a codec, and currently it does not integrate with Zarr. Also discussed at length in zarr-developers/zarr-python#877. The basic conclusion I reach is that Caterva sits at the same level of abstraction as Zarr. It's a container, does metadata, wants to manage its own I/O. So you would either use Caterva OR Zarr, but not both. A more suitable integration might be to implement Blosc2 in numcodecs, which would bring many of the benefits of Caterva. |
Let me also add that zfp itself provides similar functionality, i.e., it exposes C and C++ multidimensional array containers that support read and write random access, (de)serialization, iterators and proxy pointers, views, etc. zfp 1.0.0 allows integrating codecs other than zfp via templates. I have no experience with Zarr or Caterva, but my understanding is that they have a some overlap with zfp's compressed-array classes. |
Excellent discussion, guys, thank you for the pointers. Sounds like this in v3 Zarr spec will provide Caterva like functionality. zarr-developers/zarr-python#1111 spec: https://zarr-specs.readthedocs.io/en/latest/extensions/storage-transformers/sharding/v1.0.html ZFP inside the sharded storage transformer will be very useful for our use cases! |
@lindstro I think your characterization is accurate. Zarr takes a modular approach at the python level (separate store object, codec, and array-interface [i.e. zarr-python]). This modularity has served us well (very easy to add a new codec or store implementation) but it prevents us from taking advantages of some of the full-stack optimizations afforded by zfp, blosc2, or caterva because they transcend multiple layers of this stack. It would be interesting to think about how to solve this... 🤔 |
I just learned about a new compression library called ZFP: https://github.com/LLNL/zfp
It would be excellent to add ZFP compression to Zarr! What would be the best path towards this? Could it be added to numcodecs?
The text was updated successfully, but these errors were encountered: