-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Bindings: Numpy Support #29
Conversation
Oh. It's probably worth mentioning that the bindings aren't built by default. You have to provide |
@SteVwonder This is fantastic! I'm up against a paper deadline this week but either I and/or @salasoom will take a closer look over the next few days. |
Can you base the PR from the Now that we are finishing bindings in a couple languages, we may mess around with the directory structure in the next release. For now, please separate the tests into dir When on I've been meaning to refactor some of those utility libraries, so your input can help re-shape those APIs to be the least painful when calling from Python (ex. removing double pointers). As far as benchmarking goes, that's something we've been meaning to get around to, an executable we can run on local machines, because we see a large range of test runtimes on Travis and Appveyor. For now, we time how long compression and decompression take for those million element arrays, and print them out as part of the test. At least then we have a history of CI log output with some timings. |
@SteVwonder I've had a chance to skim your PR. I like a lot of what I'm seeing. Great job! I also have a few comments and concerns: compress_numpy (which is perhaps the most important function) supports fixed-accuracy mode only, while decompress_numpy supports all modes. Although we have to start somewhere, this is a significant limitation. While we support only fixed-rate compression in the CUDA version, there are good technical reasons for this having to do with lack of knowledge where in the compressed stream each thread should start. I know we'll be making changes to the Python API before the release, but I wanted to point out that we'll have to change this function to support the other modes. For now, it might be best to rename the function to make it explicit that it only handles the fixed-accuracy mode. On the other hand, only a single line (zfp_stream_set_accuracy) has to change to support fixed-precision and fixed-rate also. Alternatively, we might consider passing in an object that encodes the compression mode, which is essentially what zfp_stream does, and then add three more short functions for setting the compression parameters. I think Markus or I can make this change. I don't think we should have a default tolerance. The user should be aware that this is something they need to specify. If they don't know how to set this, then I think it would be safer to use a tolerance of zero. Another difference wrt. zfp_compress is that only one field can be compressed at a time. That might be fine for the Python interface since one can simply make multiple calls and concatenate the compressed byte strings. However, this does cause an incompatibility with files compressed via the C interface, as zfp_compress does not insert a header between each field. This is something worth documenting at least. Does Cython expose internal implementation functions like _init_field in the global namespace? That would violate the ANSI standard, which reserves all public symbols starting with _. class Memory(): Maybe check the return value of malloc? The return value to zfp_decompress is not checked. It will be zero in case of failure. Maybe throw an exception? One potential concern: By default, zfp's compressed data is in multiples of 64-bit "words." Are there any concerns that Python might pass an unaligned array of bytes to decompress_numpy? That could cause issues on some platforms. Regarding the changes to the LICENSE, LLNL has gotten picky about what can go in this file: see https://software.llnl.gov/about/licenses/. We may have to update the license and place this additional information in the python directory. |
fdeb157
to
37ba369
Compare
Codecov Report
@@ Coverage Diff @@
## develop #29 +/- ##
===========================================
- Coverage 93.55% 93.18% -0.38%
===========================================
Files 62 64 +2
Lines 3058 3477 +419
===========================================
+ Hits 2861 3240 +379
- Misses 197 237 +40
Continue to review full report at Codecov.
|
@SteVwonder Currently, there's an assumption that the numpy array uses C ordering, which may or may not be true. I think it would be safer to call zfp_set_stride_* in _init_field by translating the ndarray.strides byte strides to zfp's scalar strides. See this discussion regarding this issue: zarr-developers/numcodecs#160. |
@lindstro, I looked into the I haven't found a general way to reference fields/variables in C that overlap with reserved keywords in Python/Cython. Ultimately though, I don't think this is much of an issue for zfp. Cython does not require that every field in a struct be listed. It only requires those fields that you need to directly access from Cython code. So we can just omit |
@SteVwonder I'm fine with that for the current Python implementation. I still think we should take this into consideration and change C struct member names for the next release to avoid issues like this. With regards to the |
9587f29
to
b8c9d92
Compare
Just pushed a commit adding support for fixed-rate and fixed-precision modes. The Hopefully this works for the EOY milestone, and we can come up with a better interface for the March release. Also, same caveat as before that the current python test cases just verify that nothing catastrophic happens when doing a round-trip compression/decompression using the bindings.
After peeking at the utility libraries, I think I have a few ideas on how to make them more usable from Python. @salasoom, can we meet in person after the holidays to go over these tests a bit more in-depth and brainstorm?
👍 Done.
Makes sense. Removed the default tolerance as part of the fixed-rate/precision commit.
That's a good point. I can document that as a limitation of the current numpy-specific interface. Would adding a boolean
Good ideas. Both are done.
👍 sounds good to me. Also, I spoke too soon. It does appear that Cython has support for struct members whose names conflict with python keywords (cython reference). I believe the follow should work (will try out tomorrow).
That is possible. I have only been able to test against python 3 while on my mac. I can test against both py2 & 3 on Linux tomorrow. |
@SteVwonder Well done! I have a few comments below.
Thanks--I think this is fine for now. -1 is not valid for any of the three parameters, so that'll work. But I agree that we need a cleaner interface for the release.
OK. I'm still concerned about the memory layout of the numpy array. Did you get a chance to look into using strides? I believe asfortranarray actually permutes the array, which could be expensive and unnecessary. Instead, I would suggest mapping shape[3] to nx, shape[2] to ny, and so on. This corresponds to the default ndarray layout and presumably the most common case. Then set the zfp strides to guard against other layouts. This ensures that in the most common case, we will traverse the numpy array in sequential order, and no deep copies are needed.
Yes, I think that makes sense.
That's fine. I was just concerned about polluting the global namespace with internal functions and doing so in an ANSI non-compliant manner.
Nice.
Sounds good. I noticed that this PR does not pass on Travis. Do you know why? |
d2ba3a8
to
ebe63fa
Compare
I have not, but I agree that strides is definitely the cleaner/better way to go in the long term. Unfortunately, I will not be able to implement that before the EOY.
I'm not 100% sure of the root cause, but I think it has to do with either the version of python (v3.4) or Cython (v?) in Ubuntu Trusty. After switching to Ubuntu Xenial, everything passes. I will investigate that further.
Forgot to mention earlier that I made this change. The top-level LICENSE is unchanged, and there are two subdirectories under the
Is there a file currently generated as part of the testsuite to test this against? PS - I added a commit that removes references to reserved keywords like |
No worries. Let's tackle that next year. Would be nice to verify that the Python version can decompress a file written by the zfp command-line tool (see more below).
Nice!
Sounds good.
No, we don't include binaries with zfp. But, what I would suggest is that you generate a small rectangular 2D or 3D floating-point array and compress it with the zfp command-line tool (use -h to write a header). A linear sequence of integers will do: f(x, y, z) = x + nx * (y + ny * z). You want to make sure nx != ny != nz, e.g., nx = 15, ny = 16, nz = 17. Then feed the compressed byte stream to your decompressor and verify that it works correctly. You could use -a 0 to achieve near lossless compression. Then repeat with zfPy compressing the file and the command-line tool decompressing.
Can you please elaborate? Surely zfp_field.type needs to be referenced somewhere? We should be careful to avoid Python keywords in the zfp C implementation. I tried to locate a list of reserved keywords but did not see either |
👍 Sounds like a solid plan. We should be able to write the "compress with zfPy and decompress with CLI" test after adding the
Sorry, I was being sloppy with my terms.
You are absolutely correct; it does need to be referenced. In the case of
And the references to the member look like: In the case of |
@SteVwonder We now have support for reversible (lossless) compression (see feature/lossless branch). I think a quite natural default would be to use reversible compression when the user does not specify a rate, precision, or tolerance. We'll have to work on integrating these branches, but I wanted to mention this as a perhaps better alternative than throwing an exception when no optional argument is passed. |
Following this work with interest. I have a question that I couldn't quite answer from reading the code: can the |
@SteVwonder can correct me if I'm wrong, but the current implementation allocates a buffer (ndarray). The reason for this is that the caller of decompress_numpy does not necessarily know how large a buffer to allocate since the header information that stores array dimensions and scalar type is part of the compressed stream. If this metadata were available elsewhere, then I believe there's no technical reason why we could not supplement the API with a function that also accepts the buffer to decompress into. We'd want to ensure that the buffer is large enough to hold the decompressed data. The zfp C implementation requires the caller to allocate the buffer anyway, which is made easier by separating the calls for parsing the header and decompressing the array. @rabernat Can you point us to en example of how such functionality is being used in Zarr/numcodecs? |
The python wrapper for blosc in numcodecs contain an optional |
👍
Agreed! |
Seems like a reasonable solution. We should be able to do something similar. zfp supports parallel compression using OpenMP and CUDA. Is that something that would be useful in Zarr? |
Definitely! I'm a bit concerned that we now have two competing implementations of the python bindings for zfp, the one here and the one from @halehawk in zarr-developers/numcodecs#160. |
I think it is fine. I implemented basic compression/decompression of zfp
which can be used immediately. Peter and his team can add parallel and
image compression slowly.
…On Thu, Jan 10, 2019 at 2:44 AM Ryan Abernathey ***@***.***> wrote:
zfp supports parallel compression using OpenMP and CUDA. Is that something
that would be useful in Zarr?
Definitely!
I'm a bit concerned that we now have two competing implementations of the
python bindings for zfp, the one here and the one from @halehawk
<https://github.com/halehawk> in zarr-developers/numcodecs#160
<zarr-developers/numcodecs#160>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AIDyFPiyBCh9OKnoWIhIRCFXKBtpwA3hks5vBwuVgaJpZM4ZSo8n>
.
|
Nice to see work on python bindings for zfp. I would like to test the python bindings for zfp. How would I do this? As far as I see it, there is no release of this yet. How can I install and use the current development version of these bindings? Thanks in advance :) |
includes validation for number of dimentions, zfp_type, zfp_mode, and compression parameter number
remove call to `asfortranarray` in `compress_numpy`, replace with calls to `zfp_field_set_stride_*d` when creating the `zfp_field` tests of floating point arrays beyond 1D now work
precision and accuracy compression aren't designed for integer types, don't use them in the checksum testing
pull input validation out into separate functions
Per @romankarlstetter's suggestion, make sure docstrings and function signatures are included in the compiled bindings. This greatly improves python binding usage in interactive scenarios, like Jupyter Notebooks.
enables other python threads to execute while executing the computational expensive C functions zfp_compress and zfp_decompress
raise a RuntimeError if the write failed and the return is 0
instead, leverage numpy's itemsize method
…ent compilers, plus some 2.7 (osx included) fix python test utils compile error
flipping the strides in the `compress_numpy` function means we have to provide to numpy the reversed of how we actually want the data compressed
add unittests for this funcionality
The associated metadata (ztype and shape) are now required arguments. There is no longer an attempt to read the header, and thus there is no validation of user-provided metadata against header metadata. Remove strides since they aren't test and aren't fully working.
8b5332a
to
8d64d04
Compare
@salasoom and @lindstro: just wanted to post this PR to update you on the progress of the python bindings. Support for compression/decompression from/to numpy arrays has been added, as well as a few tests to verify that a "round-trip" of compression/decompression works. @salasoom, if you have some time before the holidays, it would be great to discuss what other tests would be beneficial.
TODO:
write_header
argument forcompress_numpy