Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate interpretation of _Unsigned attribute #1453

Merged
merged 16 commits into from
Jul 28, 2017

Conversation

deeplycloudy
Copy link
Contributor

@deeplycloudy deeplycloudy commented Jun 14, 2017

@deeplycloudy
Copy link
Contributor Author

In addition to the included (basic) test I've also tested this with the real-world data that motivated the PR and #1444. While it's a working draft, I'd welcome comments on the basic approach and appropriateness of the test coverage.

@shoyer
Copy link
Member

shoyer commented Jun 15, 2017

Instead of putting this alongside the mask_and_scale logic, can you make a separate class to do the dtype fixing in decode_cf_variable? Take a look at BoolTypeArray for an example.

@deeplycloudy
Copy link
Contributor Author

I've created a new UnsignedIntTypeArray and have separated the logic from mask_and_scale. Lint has been removed and docs updated.

@deeplycloudy
Copy link
Contributor Author

The CI fail is for 2.7/cdat/pynio in a couple of my new tests. In one, the fill value is not being applied, while in the other the unsigned conversion isn't happening. Are there any known differences in that cdat/pynio stack that would cause these to fail while others pass?

@deeplycloudy
Copy link
Contributor Author

Tests now pass after I realized I wasn't converting the _FillValue to unsigned.

I also turned off PyNIO's internal support for masking, in keeping with the philosophy that xarray should only use the backends to retrieve the bytes as represented on disk.

Note that some of the CI builds are skipping most of their tests (e.g, py=3.4; you can tell by the run time). This is a problem in other PRs as well.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Can you run flake8 again to make sure there aren't any PEP8 violations. I just had one comment of substance.

We intentionally skip some I/O and plotting tests on some of the test matrix so we can test xarray with/without some optional dependencies.

@@ -637,6 +665,13 @@ def maybe_encode_dtype(var, name=None):
'any _FillValue to use for NaNs' % name,
RuntimeWarning, stacklevel=3)
data = duck_array_ops.around(data)[...]
if '_Unsigned' in encoding:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is _Unsigned = False a valid attribute?

If so, we need this to be if encoding.get('_Unsigned', False).

Copy link
Contributor Author

@deeplycloudy deeplycloudy Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the documentation I assembled as part of the same issue in NetCDF4-python, it looks like _Unsigned = "true" is present if the data are unsigned, with no mention of _Unsigned = "false". I haven't seen an example of False in the wild, but, your suggestion is consistent with the patch to NetCDF4-python for this issue.

I made the change you recommended on the encode side.

On decode there is a corresponding check of the attribute, so I made a change there, too.

self.assertDatasetIdentical(encoded,
create_encoded_unsigned_masked_scaled_data(
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's pep8 this:

        self.assertDatasetIdentical(
            encoded, create_encoded_unsigned_masked_scaled_data())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. That's much better.

self.assertDatasetAllClose(encoded, actual)
# make sure roundtrip encoding didn't change the
# original dataset.
self.assertDatasetIdentical(encoded,
create_encoded_masked_and_scaled_data())
create_encoded_masked_and_scaled_data()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 / move last ). This whole call should actually fit on one line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're good to go here. @shoyer - I'll let you have the final review/merge.

@shoyer
Copy link
Member

shoyer commented Jul 18, 2017 via email

@@ -637,6 +665,13 @@ def maybe_encode_dtype(var, name=None):
'any _FillValue to use for NaNs' % name,
RuntimeWarning, stacklevel=3)
data = duck_array_ops.around(data)[...]
if encoding.get('_Unsigned', False):
unsigned_dtype = 'i%s' % dtype.itemsize
old_fill = np.asarray(attrs['_FillValue'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block should be guarded in a check to verify that _FillValue is actually defined as an attribute.

@@ -786,6 +822,16 @@ def decode_cf_variable(var, concat_characters=True, mask_and_scale=True,
dimensions = dimensions[:-1]
data = CharToStringArray(data)

pop_to(attributes, encoding, '_Unsigned')
is_unsigned = encoding.get('_Unsigned', False)
if (is_unsigned) and (mask_and_scale is True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the extra parentheses here, and use implicit boolean checks instead of is True. So this should be: if is_unsigned and mask_and_scale

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Need to convert the fill_value to unsigned, too
# According to the CF spec, the fill value is of the same
# type as its variable, i.e. its storage format on disk
fill_value = np.asarray(fill_value, dtype=data.unsigned_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply cast fill_value to the dtype of data unilaterally here? e.g., fill_value = np.asarray(fill_value, dtype=data.dtype) without the if is_unsigned and has_fill check?

That seems a little more robust to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I had to leave the has_fill check because fill_value will be None in cases where there is no fill_value attribute.

self.assertDatasetAllClose(decoded, actual)
with self.roundtrip(decoded,
open_kwargs=dict(decode_cf=False)) as actual:
# TODO: this assumes that all roundtrips will first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this redundant TODO note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

open_kwargs=dict(decode_cf=False)) as actual:
# TODO: this assumes that all roundtrips will first
# encode. Is that something we want to test for?
self.assertDatasetAllClose(encoded, actual)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure assertDatasetAllClose checks dtypes. It would be good to check explicitly here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@shoyer shoyer merged commit e3e6db5 into pydata:master Jul 28, 2017
@shoyer
Copy link
Member

shoyer commented Jul 28, 2017

Thanks @deeplycloudy !

@jhamman jhamman modified the milestone: 0.10 Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants