Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behaviour crash on NaN/inf/-inf input #242

Open
mjwillson opened this issue Aug 19, 2024 · 5 comments
Open

Undefined behaviour crash on NaN/inf/-inf input #242

mjwillson opened this issue Aug 19, 2024 · 5 comments

Comments

@mjwillson
Copy link

Sorry another UB bug:

I'm aware that zfp doesn't currently support round-tripping non-finite values, but it would be good if it didn't crash hard on NaN/inf/-inf input, in particular since client libraries like xarray / zarr will often pad data with NaNs, sometimes at a fairly low level in the library e.g. to achieve uniform chunk sizes before writing to zarr.

As in #241, with clang I get SIGILL crashes, and with UBSAN enabled it's attributed to, e.g.:

third_party/zfp/src/template/encodef.c:57:17: runtime error: nan is outside the range of representable values of type 'int'
    #0 0x55cf45e3edf7 in zfp_encode_block_float_4 third_party/zfp/src/template/encodef.c:98
    #1 0x55cf45e3f916 in zfp_encode_partial_block_strided_float_4 third_party/zfp/src/template/encode4.c:88:10
    #2 0x55cf45e480bc in compress_strided_float_4 third_party/zfp/src/template/compress.c:105:13
    #3 0x55cf45e464b2 in zfp_compress third_party/zfp/src/zfp.c:1116:3
...
SUMMARY: UndefinedBehaviorSanitizer: float-cast-overflow third_party/zfp/src/template/encodef.c:57:17

Assuming we want to allow some default conversion to happen silently, this can be worked around via -fno-strict-float-cast-overflow -fno-sanitize=float-cast-overflow, but I'm reluctant to enable these too widely in case it masks some other bug.

@mjwillson
Copy link
Author

In fact in reversible mode at least, with these flags I do find it round-trips NaNs correctly. YMMV and obviously wouldn't work in lossy mode.

@lindstro
Copy link
Member

It is of course undesirable for zfp to crash if fed NaNs or infinities, which I suppose happens in the conversion from floating point to integer in the block-floating-point transform, which causes UB. On the other hand, what would be an acceptable solution here? zfp cannot do anything meaningful with such data (except in reversible mode, as you correctly point out), and so could only at best return an error without actually representing the data. At this point, the application basically has to terminate as this results in complete loss of data. Of course, doing so without crashing is preferable.

However, to detect non-numbers, one would have to insert potentially expensive infinite() calls/macros. Since there are no such calls in C89, you have to rely on constructs like x != x to detect NaN and x > DBL_MAX || -x > DBL_MAX to detect infinities. Doing so for every value to be compressed seems expensive for those applications that use zfp for arithmetic--you wouldn't instrument your numerical application to check every intermediate floating-point value for NaN or infinity. This would lower performance for all applications, including those that can guarantee that all data is numerical.

When zfp is used for offline storage (as in xarray and zarr), a reasonable solution would be for the application code to perform such checks before calling zfp. I realize that such libraries may have to be modified deep down to perform such checks, but they still need to deal with the fact that zfp would fail (ideally without crashing) in this case.

In the case of zarr, there's a layer around zfp (in numcodecs) where, IMO, such testing (e.g., via numpy.isfinite() and numpy.any()) ought to be done so that zfp is only fed numerical data. This would insulate such zfp-specific tests (that likely apply to other compressors as well) from the core zarr library.

I'm unaware of xarray usage of zfp and so can't comment on that particular case.

All this being said, we have thought of several ways of supporting (encoding) non-numbers, primarily for offline storage, where performance is not as critical. Likely the solution would be to set a compile-time flag or a runtime flag (in zfp_stream), or to expand the API to enable such a capability so as not to bog down the primary use case of compressing numbers. Regardless of how this is achieved, it would almost surely imply backwards incompatible changes to the underlying zfp code stream, and therefore won't be supported anytime soon.

Sorry for not being more helpful here, but I think having the application or I/O layer check the data for validity is a reasonable compromise. Another might be to add a compile-time macro (disabled by default) to zfp to request that it check inputs and return zero upon failure. Keep in mind that this would result in zfp_compress() returning zero (failure) if even a single value in a gigabyte-sized array is non-numerical, which may or may not be desirable.

@mjwillson
Copy link
Author

mjwillson commented Aug 20, 2024

Thanks again for the considered response.

A hard crash isn't nice, but I think it would be reasonable to always raise a Python exception from zfpy on NaN/inf/-inf input.

Even better would be an option to silently ignore NaN/inf/-inf, if you don't care what these values round-trip as and don't want the overhead of checking for and replacing them before calling zfp -- although not strictly essential.

I'm not an expert on this but I believe on CPU you can trap some of these failure conditions at a fairly low level via signal handlers without having to check each value, and I think this is what numpy does with e.g. np.seterr(invalid='raise'), this allows it to then raise a Python exception.

IMO numpy's approach is quite nice in giving the user a choice: do you want to silently ignore and accept some possibly-implementation-dependent default value, raise an exception, log a warning but proceed, ...

You can see this play out by trying e.g. np.float32('nan').astype(np.int32) under various np.seterr settings.

@lindstro
Copy link
Member

Thanks for the suggestion. From IEEE 754-2019 Section 5.8 (Details of conversions from floating-point to integer formats), it seems clear enough that an exception would be generated:

When a NaN or infinite operand cannot be represented in the destination format and this cannot otherwise
be indicated, the invalid operation exception shall be signaled. When a numeric operand would convert to
an integer outside the range of the destination format, the invalid operation exception shall be signaled if
this situation cannot otherwise be indicated.

Though it's unclear if numpy.seterr() would apply as the code being executed is written in C and called from Cython. While I can confirm using AppleClang that the floating-point status flags are set on integer overflow in this conversion, it does not appear that SIGFPE is signaled (at least the signal handler is never invoked, either on overflow or division by zero). As far as I know, that would be the only way of detecting such exceptions in C without testing the flags. But even if such a signal were generated, would we expect it to "propagate" back to Python where it can be caught as an exception?

Additionally, the invalid operation exception would be raised only when the actual conversion takes place, which invokes UB. From C99 6.3.1.4:

If the value of the integral part cannot be represented by the integer type, the behavior is undefined.

Would it not be "too late" at that point to catch the exception, assuming one is thrown? Probably not in practice, but preferably UB would be avoided altogether.

@mjwillson
Copy link
Author

mjwillson commented Aug 21, 2024

Hey, yeah sorry I didn't mean that np.seterr would catch this for you, just that it's an example of a python library with functionality that does avoid crashing on similar errors in its own code.

Looks like I was wrong and numpy is not trapping SIGFPE, but testing CPU flags. Again not an expert but I'm assuming these flags persist until cleared, so numpy can check only after the entire operation has completed?
https://github.com/numpy/numpy/blob/83cd89b835dca7d5ea8b325e1fecec124316fd12/numpy/_core/src/npymath/ieee754.c.src#L357
If and how it avoids UB I'm not sure.

I can see this is a tricky one anyway, and not as urgent as the other bug, so no worries if there's not an easy fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants