Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFP Compression #117

Open
rabernat opened this issue Oct 15, 2018 · 62 comments
Open

ZFP Compression #117

rabernat opened this issue Oct 15, 2018 · 62 comments
Labels
enhancement help wanted New codec Suggestion for a new codec

Comments

@rabernat
Copy link
Contributor

I just learned about a new compression library called ZFP: https://github.com/LLNL/zfp

zfp is an open source C/C++ library for compressed numerical arrays that support high throughput read and write random access. zfp also supports streaming compression of integer and floating-point data, e.g., for applications that read and write large data sets to and from disk.

zfp was developed at Lawrence Livermore National Laboratory and is loosely based on the algorithm described in the following paper:

Peter Lindstrom
"Fixed-Rate Compressed Floating-Point Arrays"
IEEE Transactions on Visualization and Computer Graphics
20(12):2674-2683, December 2014
doi:10.1109/TVCG.2014.2346458

zfp was originally designed for floating-point arrays only, but has been extended to also support integer data, and could for instance be used to compress images and quantized volumetric data. To achieve high compression ratios, zfp uses lossy but optionally error-bounded compression. Although bit-for-bit lossless compression of floating-point data is not always possible, zfp is usually accurate to within machine epsilon in near-lossless mode.

zfp works best for 2D and 3D arrays that exhibit spatial correlation, such as continuous fields from physics simulations, images, regularly sampled terrain surfaces, etc. Although zfp also provides a 1D array class that can be used for 1D signals such as audio, or even unstructured floating-point streams, the compression scheme has not been well optimized for this use case, and rate and quality may not be competitive with floating-point compressors designed specifically for 1D streams.

zfp is freely available as open source under a BSD license, as outlined in the file 'LICENSE'. For more information on zfp and comparisons with other compressors, please see the zfp website. For questions, comments, requests, and bug reports, please contact Peter Lindstrom.

It would be excellent to add ZFP compression to Zarr! What would be the best path towards this? Could it be added to numcodecs?

@jhamman
Copy link
Member

jhamman commented Oct 15, 2018

It looks like there are already some python/numpy bindings: https://github.com/seung-lab/fpzip

@alimanfoo
Copy link
Member

alimanfoo commented Oct 15, 2018 via email

@rabernat
Copy link
Contributor Author

Digging deeper, it looks like fpzip does not support missing data: https://zfp.readthedocs.io/en/release0.5.4/directions.html#directions

This is a dealbreaker for now (at least for me), but it looks like it might be addressed soon.

@jakirkham
Copy link
Member

How do you handle missing data with Zarr currently?

@william-silversmith
Copy link

william-silversmith commented Oct 19, 2018

@rabernat I'm the author of the fpzip python bindings. I think you may be conflating zfp and fpzip. fpzip is a lossless codec while zfp isn't. Try the following:

import fpzip
import numpy as np

x = np.array([[[ np.nan ]]], dtype=np.float32)
y = fpzip.compress(x)
z = fpzip.decompress(y)
print(z)

Worth noting that fpzip is for 3D and 4D data though 1D and 2D data compression is supported via adding dimensions of size 1.

@william-silversmith
Copy link

william-silversmith commented Oct 19, 2018

Also, if you guys find fpzip interesting, I'd be happy to add support for more flexible support for Fortran vs C order. I don't think it's a hard constraint, it's just what I got working for my use case.

UPDATE: fpzip 1.1.0 respects C order arrays.

@rabernat
Copy link
Contributor Author

rabernat commented Oct 21, 2018

I think you may be conflating zfp and fpzip. fpzip is a lossless codec while zfp isn't.

@william-silversmith, thanks for clearing this up. I'm going to blame the mix-up on @jhamman with this comment.

Thanks also for writing fpzip and sharing it with the community!

I did some benchmarking of fpzip vs. zstd on real ocean data. You can find my full analysis in the this notebook. The results are summarized by these two figures, which compare the compression ratio vs. encoding / decoding time of the two codecs on two different ocean datasets, one with a land mask (i.e. nan's) over about 43% of the domain and one without.

encoding:
image

decoding:
image

It's interesting to note how zstd finds the missing data (encoded as nans) immediately; the compression ratios on the data with the mask are much higher. Since fpzip doesn't allow nans, I just filled with zeros. With fpzip, there are only very minor compression ratio differences between the masked and the unmaksed arrays.

Based on this analysis, in terms of decoding time (which is my main interest), fpzip is nowhere close to zstd. To get fpzip to speed up encoding or decoding, I have to go out to precisions of < 20, which results in acceptable losses:

image

The caveat is that this is all done on my laptop, so might not be very robust or representative. Please have a look at my notebook and see if I have done anything obvious wrong. (It should be fully runnable from anywhere.)

@rabernat
Copy link
Contributor Author

Also, in case this was not clear, I am basically not getting any compression of either dataset with fpzip using the default precision:

>>> len(fpzip.compress(data_sst))/data_sst.nbytes
1.0000054012345678
>>> len(fpzip.compress(data_ssh))/data_ssh.nbytes
1.0000054012345678

Am I doing something wrong?

@william-silversmith
Copy link

william-silversmith commented Oct 22, 2018

Hi Ryan!

Thanks for the info. I'm glad you found it useful to evaluate fpzip! I think Dr. Lindstrom, whom I've corresponded with, would be happy more people are looking at it.

I have a few comments that I'll just put into bullets:

  • fpzip does support NaNs, so you can use your land mask as-is. It's zfp that doesn't support missing values.
  • I think fpzips efficacy depends a lot on the dataset. You might very well be justified in using zstd on your dataset. For my lab's 4D data (X,Y,Z, channel) generated during boundary detection on a 3D biomedical image, we saw gzip compress to 65%, zstd compress to 59%, and fpzip compress to 46% losslessly. My colleague, Nico Kemnitz, found that with some manipulations that cause a loss of one machine epsilon, we can compress to 56%, 50%, and 32% respectively on our benchmark. You can read more here: https://github.com/seung-lab/cloud-volume/wiki/Advanced-Topic:-fpzip-and-kempressed-Encodings
  • You're correct that fpzip's decompress time is greater than zstd. I find that it has often near symmetric compression and decompression rates. It tends to win by a lot on compression and lose by a bit on decompression compared with gzip and zstd.
  • The use case my lab is concerned with is that we are generating (without compression) 5 PB of floating point data. So we were less concerned with decode time than simply getting things crunched down.

I did a small test of some of the pangeo data using the following script and I think I must have injected a bug. I think the reason there's so little compression is because the tail of the compressed data are all zeros. Let me investigate this.... Apologies, this python library is pretty new and we haven't put it into production use yet so it's not 100% battle tested.

import fpzip
import numpy as np
import gcsfs
import pandas as pd
import xarray as xr

gcs = gcsfs.GCSFileSystem(project='pangeo-181919', token='anon')
# ds_ssh = xr.open_zarr(gcsfs.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt',
#                                gcs=gcs))

ds_llc_sst = xr.open_zarr(gcsfs.GCSMap('pangeo-data/llc4320_surface/SST',
                               gcs=gcs), auto_chunk=False)
data = ds_llc_sst.SST[:5, 1, 800:800+720, 1350:1350+1440].values

x = fpzip.compress(data)
print(x)

@william-silversmith
Copy link

william-silversmith commented Oct 22, 2018

I just updated fpzip to version 1.1.1 that should have the trailing zeros problem fixed. Give it a try again!

EDIT: I did a quick test and I saw the following results for the above script:

20736000 bytes raw 
13081082 bytes fpzip (63%)
17396243 bytes gzip -9 (84%)

@lindstro
Copy link

lindstro commented Nov 6, 2018

@rabernat, I'm the developer of fpzip (and zfp). Having worked a lot with climate scientists, I would certainly expect high-resolution ocean data like yours to compress quite well (definitely to less than 100%!). @william-silversmith, who has done a great job adding a Python interface to fpzip, reports a reduction to 63% after his latest fix, which seems more plausible. Perhaps you can give fpzip another try?

If you're willing to sacrifice some loss to within some acceptable error tolerance, then I would suggest checking zfp out. It's a far faster and more advanced compressor than fpzip.

@rabernat
Copy link
Contributor Author

rabernat commented Nov 7, 2018

Hi @lindstro -- thanks for droping in here. And most of all, thanks for your important work on compression! As a computational oceanographer, I recognize that this is really crucial for the future sustainability of high-resolution modeling.

I updated my analysis with the latest version of fpzip. Indeed it actually compresses the data now! 😉

Here is a figure which summarizes the important results. In contrast to my previous figure, I have eliminated the precision argument for fpzip and am using it as intended: lossless mode (directly comparable to zstd).

image

In terms of compression ratio, fpzip beats zstd when there is no land mask (holds regardless of the zstd level). With a land mask, zstd is able to do a bit better. zstd is a lot faster, particularly on decoding, but that is not a dealbreaker for us. (50 MB/s is still fast compared to network transfer times.)

Based on this analysis, I think we definitely want to add fpzip support to numcodecs.

If you're willing to sacrifice some loss to within some acceptable error tolerance, then I would suggest checking zfp out. It's a far faster and more advanced compressor than fpzip.

Yes we would love to try zfp. Is there a python wrapper for it yet?

@rabernat
Copy link
Contributor Author

rabernat commented Nov 7, 2018

Another point about fpzip: compression ratio is considerably higher (0.4 vs 0.55) if I transpose the arrays from their native python / C order to Fortran order when I feed them in. @william-silversmith -- I notice you have added the order keyword to fpzip.decompress but not to fpzip.compress. Would it be possible to add support for compression of C-order arrays? This would allow us to avoid the manual transpose step at the numcodecs level.

@rabernat
Copy link
Contributor Author

rabernat commented Nov 7, 2018

@alimanfoo - presuming we want to move forward with adding fpzip and zfp to numcodecs, what would be the best path? Would you want to use @william-silversmith's python package, or would you want to re-wrap the C code within numcodecs, as currently done for c-blosc? The latter approach would also allow us to use zfp without an independent python implementation. But it is a lot more work. Certainly not something I can volunteer for.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 7, 2018 via email

@lindstro
Copy link

lindstro commented Nov 7, 2018

@rabernat, thanks fore rerunning your compression study. fpzip was developed back in 2005-2006, when 5 MB/s was fast. I've been meaning to rewrite and parallelize it, but I've got my hands full with zfp development.

Regarding transposition, clearly we'd want to avoid making a copy. fpzip and zfp both use the convention that x varies faster than y, which varies faster than z. So an array of size nx * ny * nz should have a C layout

float array[nz][ny][nx];

@william-silversmith's Python wrappers should ideally do nothing but pass the array dimensions to fpzip in the "correct" order and not physically move any data. zfp (but not fpzip) also supports strided access to handle arrays of structs without having to make a copy.

As far as Python wrappers for zfp, we're just now starting on that and have an end-of-the-year milestone to deliver such a capability. The developer working on this has suggested using cffi. As I'm virtually Python illiterate, I'm not sure how that would interact with numcodecs. Any suggestions are welcome.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 7, 2018 via email

@rabernat
Copy link
Contributor Author

rabernat commented Nov 7, 2018

Ok, so we are clearly in a "help wanted" situation.

Implementing the necessary wrappers / documentation / tests for fpzip and zfp in numcodecs would make an excellent student / intern project. It is a clearly defined task with plenty of examples to draw from. And it will have a big impact by bringing these codecs to a broader community via zarr. So if anyone knows of any students or other suitable candidates looking to get their feet wet in open source, please send them to this issue!

Also, I'm transferring this to the numcodecs issue tracker, where it clearly belongs.

@rabernat rabernat transferred this issue from zarr-developers/zarr-python Nov 7, 2018
@alimanfoo
Copy link
Member

Just to elaborate a little, what we would like to do is implement a codec class for ZFP. The Codec interface is defined here. Once there is a Python package binding ZFP, then the codec implementation is very simple, basically the Codec.encode(buf) method implementation would just pass through to a zfp.compress(buf) function, and similarly the Codec.decode(buf, out=None) method would ideally just pass through to a zfp.decompress(buf, out) function.

There is a detail on the decode method in that numcodecs supports an out argument which can be used by compressors that have the ability to decompress directly into an existing buffer. This potentially means that decompressing involves zero memory copies. So if a zfp package offered this ability to decompress directly into a buffer exposed by a Python object via the buffer interface, this could add an additional optimisation. However, most compression libraries don't offer this, so it is not essential. I.e., if zfp does not offer this, then in numcodecs if an out argument is provided, we just do a memory copy into out. For an example, see the zlib codec implementation.

One other detail, ideally a zfp Python binding would be able to accept any object exposing the Python buffer interface. We currently also require codecs to be able to handle array.array which in Python 2 needs a special case because it doesn't implement the buffer interface. But we can work around that inside numcodecs. I.e., it is fine if a zfp binding just uses the buffer interface, no special case needed for array.array in Python 2. E.g., this is part of the reason why there is some special case code for Python 2 within the zlib codec implementation.

Hth.

@jakirkham
Copy link
Member

We currently also require codecs to be able to handle array.array which in Python 2 needs a special case because it doesn't implement the buffer interface.

We can smooth over this a bit. Opened PR ( #119 ) as a proof-of-concept.

@lindstro
Copy link

@alimanfoo, zfp does indeed decompress directly into a user-allocated buffer. Similarly, compression writes to a user-allocated buffer whose size zfp conservatively estimates for the user. Alternatively, the user can specify how much memory to use for the compressed data, and zfp will ensure that the compressed data fits within the buffer (with quality commensurate with buffer size).

I know very little about zarr, but my understanding is that it partitions arrays into equal-shape chunks whose compressed storage is of variable length. I'm guessing chunks have to be large enough to amortize the overhead of compression metadata (I see 1 MB or so recommended). zfp provides similar functionality but uses very small chunks (4^d values for d-dimensional arrays) that are each (lossily) compressed to a fixed number of bits per chunk (the user specifies how many bits; for 3D arrays, 1024 bits is common). I wonder if this capability could be exploited in zarr without having to rely on zfp's streaming compression interface.

@jakirkham
Copy link
Member

We've revamped how codecs handle data internally. Outlined in this comment. This should make it much easier to contribute new codecs. Would be great if those interested took a look and provided any feedback.

@alimanfoo
Copy link
Member

@lindstro thanks for the information, very helpful.

Apologies if I have not fully understood the details in your comment, but you are right to say that zarr partitions an array into equal-shape chunks, and then passes the data for each chunk to a compression codec (in fact this can be a pipeline of codecs configured by the user, but typically it is just a compressor). The result of encoding the chunk is then stored, and this will be of variable length depending on the data in each chunk.

In the zarr architecture, ZFP would be wrapped as a codec, which means it could be used as the compressor for an array. So zarr would pass each chunk in the array to ZFP, and then store whatever ZFP gives it back. Zarr passes all of the data for a chunk to a codec in a single API call, so in general there is no need to use streaming compression, you can just do a one-shot encoding. Typically I find that chunks should be at a minimum 1 MB uncompressed size, usually I find upwards of 16 MB is better, depending somewhat on various factors like compression ratio and type of storage being used.

A compressor (or any other type of codec) is a black box as far as zarr is concerned. If a compressor like ZFP chooses to break the chunk down further into smaller pieces, that is an internal implementation detail. E.g., the Blosc compressor does something similar, it breaks down whatever it is given into smaller blocks, so it can then use multiple threads and compress blocks in parallel.

If it is possible to vary the size of chunks that ZFP is using internally, then this is an option you'd probably want to expose to the user when they instantiate the ZFP codec, so they could tune ZFP for a particular dataset.

Hth.

@lindstro
Copy link

@alimanfoo, let me try to clarify.

Whereas zarr prefers uncompressed chunks as large as 16 MB, zfp in its fixed-rate mode uses compressed chunks on the order of 8-128 bytes (a cache line or two), which provides for far finer granularity of access. Think of zfp as a compressed floating-point format for SSE vectors and similar. I was just thinking out loud whether a capability like that could be exploited by zarr, for example, if traversing a 3D array a 2D or 1D slice at a time, when only a very small subset of a 16 MB chunk is needed.

My reference to zfp streaming compression was meant in the sense of sequentially (de)compressing an entire array (or large portions thereof) in contrast to zfp's inline (de)compression capability, where tiny blocks (4^d scalars in d dimensions) are (de)compressed on demand in response to random access reads or writes to individual array elements.

Of course, one could use zfp as a black box codec with zarr to (de)compress large chunks, but my question had to do with whether zarr could benefit from fine-grained access to individual scalars or a few handfuls of scalars, as provided by zfp's own compressed arrays.

If not, then the best approach to adding zfp support to zarr would most likely be to use the Python bindings we're currently developing to zfp's high-level C interface, which is designed for infrequent (de)compression of the megabyte-sized chunks that zarr prefers.

@alimanfoo
Copy link
Member

Many thanks @lindstro, very nice clarification.

Of course, one could use zfp as a black box codec with zarr to (de)compress large chunks, but my question had to do with whether zarr could benefit from fine-grained access to individual scalars or a few handfuls of scalars, as provided by zfp's own compressed arrays.

We had a similar discussion a while back, as the blosc compressor offers something analogous, which is the ability to decompress specific items within a compressed buffer, via a function called blosc_getitem. The discussion is here: zarr-developers/zarr-python#40, see in particular comments from here: zarr-developers/zarr-python#40 (comment).

The bottom line is that I think it could, in principle, be possible to modify numcodecs and zarr to leverage this kind of feature. However, it would require some reworking of the API layer between zarr and numcodecs and some non-trivial implementation work. I don't have bandwidth to do work in that direction at the moment, but if someone had the time and motivation then they'd be welcome AFAIC.

As I understand it, the type of use case that would make this feature valuable are where you are doing a lot of random access operations into small regions of an array, i.e., pulling out individual values or small clusters of nearby values. I don't have any use cases like that personally, but maybe others do, and if so I'd be interested to know.

All my use cases involve running parallel computations over large regions of an array. For those use cases, it is fine to decompress entire chunks, as it is generally only a few chunks around the edge of an array or region where you don't need to decompress everything, and the overhead of doing a little extra decompression than strictly needed is negligible. Occasionally I do need to pull out single values or a small sub-region, but that is generally a one-off, so again the extra overhead of decompressing entire chunks is easy to live with.

If not, then the best approach to adding zfp support to zarr would most likely be to use the Python bindings we're currently developing to zfp's high-level C interface, which is designed for infrequent (de)compression of the megabyte-sized chunks that zarr prefers.

This sounds like a good place to start.

@lindstro
Copy link

lindstro commented Dec 3, 2018

Regarding use cases, you'll have to excuse my ignorance of how zarr works and the use cases it was designed for, but here are a few that motivated the design of zfp and its small blocks:

  • zfp serves as a substitute for conventional multidimensional random-accessible arrays. zfp generally avoids the need to rewrite existing application code to traverse the arrays in an order best suited for the underlying data structure (e.g., one chunk at a time). Ideally, you substitute only C/C++ array or STL vector declarations with zfp array declarations while leaving the remaining code intact. Sometimes the traversal order can be hidden behind iterators, but it is common in C and even C++ to use explicit indexing and nested for loops for array computations, especially in stencil-based computations that require localized random access. Because zfp's blocks are so small, it does not matter much in what order the (multidimensional) array is accessed, whereas if you use large chunks on the order of 16 MB, you'll want to process all elements in a chunk before moving on to the next one. That is, if the traversal enters and exits a chunk many times, then you don't want to (de)compress the chunk each time, and you may not be able to afford caching many large uncompressed chunks. (Does zarr cache decompressed chunks?)

  • If chunks are much larger than the hardware cache size, then performance will suffer when making repeated accesses to a chunk, e.g., via kernel fusion, stencil operations, gathers and scatters between staggered grids, etc. The compressed chunk size in zfp is usually on the order of one L1 cache line, while a decompressed chunk is a few cache lines, allowing both compressed and uncompressed data to fit in L1 cache if the access pattern exhibits good locality. If chunks are as large as 16 MB, then the decompressed data has already been evicted from L1 and L2 cache once computation on the decompressed data is executed.

  • Some applications traverse subsets of arrays. Examples include 1D and 2D slices of a 3D array, data-dependent isocontours and integral paths, boundary layers for ghost data exchange in distributed settings, regions of interest and range queries (e.g., in visualization), subsampling (e.g., for data decimation), etc. If chunks are too coarse, then there's a lot of overhead associated with decompressing data that is not accessed.

  • zfp supports both read and write access. Only (uncompressed and cached) blocks that are modified need to be written back to compressed storage when evicted from zfp's software cache. If chunks are large, then a change of a single value in a chunk would trigger a lot of data to be compressed.

I do agree, however, that it is often possible to structure the traversal and computation in a streaming manner to amortize the latency associated with (de)compression. This, however, typically requires refactoring existing code.

@alimanfoo
Copy link
Member

Thanks @lindstro, very interesting.

FWIW Zarr was designed primarily as a storage partner to the Dask array module, which implements a subset of the Numpy interface but as chunked parallel computations. It's also designed to work well either on a single machine (either multi-threaded or multi-process) or on distributed clusters (e.g., Pangeo-like clusters reading from object storage). With Dask, the sweet spot tends to be towards larger chunks anyway, because there is some scheduling overhead associated with each task. So there has not (yet) been any pressure to optimise access to smaller array regions.

But as I mentioned above it is conceivable that Zarr could be modified to leverage capabilities of ZFP or Blosc to decompress chunk sub-regions. So if someone has a compelling use case and wants to explore this then please feel free.

@jakirkham
Copy link
Member

@lindstro, one important point is Zarr supports on "disk" storage (though this could be an object store or actually files in directories) or in-memory storage. It's possible smaller chunk size could make sense for some in-memory Zarr Array use cases. The in-memory format is a dictionary with each chunk in different key-value pairs. Fortunately whatever work goes into adding a codec for Numcodecs could be leveraged easily by either use case.

@alimanfoo
Copy link
Member

@rabernat, I agree that chunks serve an entirely different purpose in zfp and in zarr. As I mentioned to @alimanfoo, it might make sense to exploit the fact that each zarr chunk is many independent zfp blocks, and to decompress only those zfp blocks needed when accessing a subset of the chunk. But I imagine supporting this would likely require major surgery on zarr.

Let's just say it would be an interesting challenge 😄. If anyone does decide they want to explore this at some point, I'd recommend having a specific, concrete use case in mind to target and benchmark, including specific computations, specific data, and a specific target platform, including what type of storage, what type of computational nodes, and what (if any) type of parallelism. I'd also recommend making sure it's an important use case, i.e., optimisation would save somebody a significant amount of time and/or money; and doing enough thinking ahead of time to be fairly convinced that the desired performance improvement should in theory be achievable. These suggestions are obvious to everyone I'm sure, but just wouldn't want anyone to sink loads of time for little reward.

@halehawk
Copy link
Contributor

halehawk commented Dec 8, 2018 via email

@jakirkham
Copy link
Member

jakirkham commented Dec 8, 2018

Normally those sorts of things go in the constructor and are set as members of the instance. Take a look at Zlib for a simple example.

@rabernat
Copy link
Contributor Author

@halehawk - a great way to move forward with this would be to just submit a pull request with your code as is. Then we can discuss the details and iterate before merging.

@halehawk
Copy link
Contributor

halehawk commented Dec 22, 2018 via email

@rabernat
Copy link
Contributor Author

Ah fantastic!

Perhaps you could edit your PR description to include the text "Closes #117" in the description? That way it will be linked to this discussion.

@halehawk
Copy link
Contributor

halehawk commented Dec 22, 2018 via email

@halehawk
Copy link
Contributor

halehawk commented Dec 22, 2018 via email

@lindstro
Copy link

lindstro commented May 7, 2019

zfp 0.5.5 has been released, which adds Python bindings to the C API and the ability to (de)compress NumPy arrays. The current release supports compression of arbitrarily strided arrays (including non-contiguous arrays), but does not yet support decompression with arbitrary strides. This release also adds lossless compression to zfp.

We are planning on expanding zfp's Python API over the next year or so, for instance, to include Python bindings to zfp's compressed arrays.

@kmpaul
Copy link

kmpaul commented Oct 29, 2019

Update: zfpy version 0.5.5 has now been released on conda-forge for Linux and OSX.

@lindstro
Copy link

Another update: zfpy version 0.5.5 is now available as a pip package for Linux, macOS, and Windows.

@kmpaul
Copy link

kmpaul commented Jun 1, 2020

I believe the conda-forge version now also supports Windows.

@tasansal
Copy link

tasansal commented Sep 4, 2022

It looks like ZFP is tightly integrated with Blosc now:

https://www.blosc.org/posts/support-lossy-zfp/

It also supports Caterva style sharding out of the box.

I am gonna try this, but not sure if a version with this Blosc is in Numcodecs.

@tasansal
Copy link

tasansal commented Sep 4, 2022

It looks like ZFP is tightly integrated with Blosc now:

https://www.blosc.org/posts/support-lossy-zfp/

It also supports Caterva style sharding out of the box.

I am gonna try this, but not sure if a version with this Blosc is in Numcodecs.

Just tried it and it is not registered in the Numcodec wrapper of Blosc. What can we do to make that available as a part of Blosc in Numcodecs?

@halehawk
Copy link
Contributor

halehawk commented Sep 6, 2022 via email

@tasansal
Copy link

tasansal commented Sep 6, 2022

I am mainly concerned about the Caterva functionality of Blosc. Regardless of the codec. Any idea about that being used?

Regarding ZFP being a codec already, would it not make sense to rely on Blosc's ZFP implementation instead of maintaining it in Numcodecs now that it can be considered upstream?

@lindstro
Copy link

lindstro commented Sep 6, 2022

To my knowledge, Blosc does not support zfp's reversible (lossless) mode, whereas numcodecs does. I'd also expect that numcodecs is used by applications other than Zarr that would benefit from zfp support.

@tasansal
Copy link

tasansal commented Sep 6, 2022

To my knowledge, Blosc does not support zfp's reversible (lossless) mode, whereas numcodecs does. I'd also expect that numcodecs is used by applications other than Zarr that would benefit from zfp support.

True, they don't support lossless.

@tasansal
Copy link

tasansal commented Sep 6, 2022

Maybe the right question I should ask is, would it be possible to support caterva within numcodecs?

@rabernat
Copy link
Contributor Author

rabernat commented Sep 6, 2022

We had a discussion about that in zarr-developers/zarr-python#713. Caterva is a lot more than just a codec, and currently it does not integrate with Zarr. Also discussed at length in zarr-developers/zarr-python#877. The basic conclusion I reach is that Caterva sits at the same level of abstraction as Zarr. It's a container, does metadata, wants to manage its own I/O. So you would either use Caterva OR Zarr, but not both.

A more suitable integration might be to implement Blosc2 in numcodecs, which would bring many of the benefits of Caterva.

@lindstro
Copy link

lindstro commented Sep 6, 2022

Let me also add that zfp itself provides similar functionality, i.e., it exposes C and C++ multidimensional array containers that support read and write random access, (de)serialization, iterators and proxy pointers, views, etc. zfp 1.0.0 allows integrating codecs other than zfp via templates. I have no experience with Zarr or Caterva, but my understanding is that they have a some overlap with zfp's compressed-array classes.

@tasansal
Copy link

tasansal commented Sep 6, 2022

Excellent discussion, guys, thank you for the pointers.

Sounds like this in v3 Zarr spec will provide Caterva like functionality. zarr-developers/zarr-python#1111

spec: https://zarr-specs.readthedocs.io/en/latest/extensions/storage-transformers/sharding/v1.0.html

ZFP inside the sharded storage transformer will be very useful for our use cases!

@rabernat
Copy link
Contributor Author

rabernat commented Sep 6, 2022

@lindstro I think your characterization is accurate. Zarr takes a modular approach at the python level (separate store object, codec, and array-interface [i.e. zarr-python]). This modularity has served us well (very easy to add a new codec or store implementation) but it prevents us from taking advantages of some of the full-stack optimizations afforded by zfp, blosc2, or caterva because they transcend multiple layers of this stack. It would be interesting to think about how to solve this... 🤔

@dstansby dstansby added the New codec Suggestion for a new codec label Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted New codec Suggestion for a new codec
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants