Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing fill values within the parameter function framework #177

Open
lukecampbell opened this issue May 29, 2014 · 23 comments
Open

Managing fill values within the parameter function framework #177

lukecampbell opened this issue May 29, 2014 · 23 comments

Comments

@lukecampbell
Copy link
Member

The functions and or the function authors aren't responsible for managing fill values, so if fill values are used as inputs incorrect outputs will be produced. The coverage clients aren't responsible either.

It would be good if we can add in correct fill value support within the coverage model.
http://nbviewer.ipython.org/github/lukecampbell/notebooks/blob/master/Masked%20Arrays.ipynb

@caseybryant
Copy link
Contributor

I would actually argue that function authors are responsible for managing fill values. That's why we have an external interface to retrieve a parameter's fill value.

Coverage model should be responsible for storing and regurgitating data, not making assumptions on what the user wants to do with it. The example listed above actually demonstrates why I believe that to be crucial. Observations 1, 5, 9, 10, 14, 15, and 20 each have a fill for either temperature or pressure but not both.
A function writer may want to do calculations if only temperature is valid, but not pressure. Another function writer may want to do calculations only if both temperature and pressure are valid. By leaving it up to the function writer to decide what to do with the data, we allow the function writers the greatest flexibility.

@lukecampbell
Copy link
Member Author

I think it's unreasonable to ask them that.

  1. The functions have no idea indication what valid values are, or rather what consists of a fill value, it changes from one parameter to the next.
  2. These functions are just math functions, they have no real connection to OOI, they can be used for anything in general ocean sciences and are not specific to the parameter functions

I would agree it's probably the user's responsibility which is me. And I will absolutely deal with it if it's necessary but, I'm just going to end up using the same handler and code in every place I use the coverage model, I would prefer to just put that code in the coverage model to reduce code duplication.

However, in the interest of not supporting a monolithic data model, I will agree that in the purest interest of the coverage model as something that stores and reads data, it doesn't go there. Unfortunately, it has been overloaded with the responsibility to also manage data processing.

@caseybryant
Copy link
Contributor

So how do I manage that? Remove every index that has fill values? It the case of your array, all values (time, temperature, and pressure) would have to be masked at every index where any of the parameters have fill values.

Valid data is masked because a function parameter might not process it correctly. Are you suggesting yet another argument flag for the get_parameter_values method?

On Jun 2, 2014, at 10:49 AM, Luke Campbell [email protected] wrote:

I think it's unreasonable to ask them that.

The functions have no idea indication what valid values are, or rather what consists of a fill value, it changes from one parameter to the next.
These functions are just math functions, they have no real connection to OOI, they can be used for anything in general ocean sciences and are not specific to the parameter functions
I would agree it's probably the user's responsibility which is me. And I will absolutely deal with it if it's necessary but, I'm just going to end up using the same handler and code in every place I use the coverage model, I would prefer to just put that code in the coverage model to reduce code duplication.

However, in the interest of not supporting a monolithic data model, I will agree that in the purest interest of the coverage model as something that stores and reads data, it doesn't go there. Unfortunately, it has been overloaded with the responsibility to also manage data processing.


Reply to this email directly or view it on GitHub.

@lukecampbell
Copy link
Member Author

Something like this maybe

def get_wrapper(parameter, time_bounds, fill_value_mask=True):
    ptype = self.get_parameter_context(parameter).param_type
    if not isinstance(ptype, ParameterFunctionType):
        return

    args, mask = build_arg_map(parameter, time_bounds, fill_value_mask)

    # Get the base data type and shape
    value_encoding = ptype.value_encoding # I think it's here
    shape = self.get_values('time', time_bounds).shape

    retval = np.empty(shape, dtype=value_encoding)
    if fill_value_mask:
        retval[mask] = ptype.function._callable(*args)
    else:
        retval[:] = ptype.function._callable(*args)



def build_arg_map(parameter, time_bounds, fill_value_mask=True):
    ptype = self.get_parameter_context(parameter).param_type

    arg_list = ptype.function.arg_list
    arg_map = ptype.function.param_map

    array_map = {}
    mask = None
    for k,v in arg_map.iteritems():
        context = self.get_parameter_context(k)
        array_value = self.get_values(k, time_bounds)
        if mask is None:
            mask = ~np.isclose(array_value, context.fill_value, equal_nan=True)
        else:
            mask = mask & ~np.isclose(array_value, context.fill_value, equal_nan=True)

        array_map[k] = array_value

    if fill_value_mask:
        for k,v in array_map.iteritems():
            array_map[k] = v[mask]

    return array_map, mask

@caseybryant
Copy link
Contributor

Unless I'm missing something, the above appears to be just how to do the masking. My question/statement is more along these lines:
get_parameter_values is currently THE interface for retrieving data. If a parameter function is requested in get_parameter_values, in order to keep alignment and eliminate bad data for the parameter function, I will have to throw away all indexes for which a fill value exists for ANY parameter - because I don't have insight into which parameters a parameter function might use. This could result in valid or important data being dropped from the array returned from get_parameter_values. Are we okay with that, or is there another solution?

@lukecampbell
Copy link
Member Author

With what I've pasted, it doesn't throw away indices. It creates an array in-memory initially of all fill values. Using the mask it fills in the array.

The key here is that we never call the function with an array that contains missing values. It doesn't treat those missing values in any special way

in a simple scenario we have a parameter function sum:

def sum(x,y):
    return x + y

In our coverage, if x is [-9999, 1, 2, 3] and y is [10, 11, 12, 13]
the output of sum should be [-9999, 1, 2, 3] not [-9989, 12, 13, 14]

so the full return dictionary would be

{
  'time' : [t0, t1, t2, t3],
  'x' : array([-9999., 1., 2., 3.], dtype=float32),
  'y' : array([10., 11., 12., 13.], dtype=float32),
  'sum' : array([-9999., 12., 13., 14.], dtype=float32)
}

Behind the scenes, sum only ever got called like this:

sum(x=np.array([1., 2., 3.]), y=np.array([11., 12., 13.]))

@caseybryant
Copy link
Contributor

Okay. I see, now, what you are requesting. Really, this is just - if any of the args have a fill-value at an index, insert the function parameter's fill-value at that index for the function parameter array.

I'm still not a parameter function expert. Can you create some tests that exercise the faulty functionality for both mathematical expressions and python function types so I can figure out the best approach? It doesn't need to be right away. I won't get to this for a few days.

@lukecampbell
Copy link
Member Author

Yeah, I'll make some use-case tests.

@caseybryant
Copy link
Contributor

I will add a new method, get_valid_indexes(), to NumpyDictParameterData - the object returned from Coverage.get_parameter_data(). NumpyDictParameterData.get_data() will continue to return a dictionary of data with fill values. get_fill_index_dict() will return a dictionary of numpy boolean arrays used to indicate if the value at an index of the data arrays is valid or if it fill.

It should be up to the parameter function to use this information to perform calculations at valid indexes only. The array returned from the parameter function should be the the same size as the array returned from the referenced parameter array and determine where fill values should be placed.

Example usage:

ndpd = cov.get_parameter_values(['cond'], fill_indexes=True, ...)
data = ndpd.get_data()['cond']
valid_indexes = ndpd.get_valid_indexes()['cond']
calculated_cond_array = np.empty(data.size, dtype=data.dtype)
calculated_cond_array[:] = fill_value
calculated_cond_array[valid_indexes] = data[valid_indexes]*2+1

@lukecampbell
Copy link
Member Author

This looks good, could you push something like this so I can test it out?

@caseybryant
Copy link
Contributor

In order to be precise in the fill value mapping, I need to know which indexes returned from a function parameter evaluation are fills. This means that the interface needs to be changed all the way through so the actual function being called can create and return the mapping.
@lukecampbell, What are your preferences on this?

@lukecampbell
Copy link
Member Author

I was under the impression that we would filter out the indexes that are fills before we execute the function(s). Parameter functions shouldn't produce fill values on their own, I think.

@caseybryant
Copy link
Contributor

I’m talking about the abstract case where a parameter function receives non-fill data but determines that an index should be a fill.
Are you saying that in all real cases, a parameter function will never return fill?

@lukecampbell
Copy link
Member Author

That's correct. To my knowledge. A fill value in it's purest sense represents a lack of information from a sensor. Whereas in the parameter function's case, it always has information about a domain OR in the case where it doesn't we use NaN, but not as a fill value but in the absence of real value, such as a division by zero or where the mapping of the domain does not result in a real result (finite domain problems like tangent).

@caseybryant
Copy link
Contributor

Before evaluating a parameter function, you'd like all fill values removed. I interpret this as:
Remove ALL observations where ANY of the parameters in the parameter function's param map has a fill.

Is that your expectation?

@lukecampbell
Copy link
Member Author

I believe that's accurate, yes.

@caseybryant
Copy link
Contributor

Hiding data is a restriction on a parameter function's ability to make decisions. I believe you're okay with this, but I want to make sure we are all on the same page.

@lukecampbell
Copy link
Member Author

Just to clarify, this behavior should be optional I think.

@caseybryant
Copy link
Contributor

I can make it optional to hide data.
In the non-optional case, should we pass fill masks to the parameter function, or should we force the function to call get_parameter_values to get the fill mask?

@lukecampbell
Copy link
Member Author

I don't think either are an option. Most parameter functions, that I'm aware of, can't or don't accept callbacks to the coverage model and they don't explicitly accept masked arrays or another parameter to identify the fill value. Some might but most don't, I think. If the option to mask the fill values is not enabled, I would say we pass the arrays as-is into the parameter functions.

I can't think of another option, but I'm open to suggestions.

@lukecampbell
Copy link
Member Author

I think maybe the use case got lost somewhere in this thread.

If we have a sensor that is observing u and v vectors and we have a function that determines wind direction given the u and v vectors, the function is not meant to understand the data model or even the concept of flagged values. It's only intended purpose, architecturally, is to compute wind direction given u and v.

def wind_direction(u,v):
    return 180. / np.pi * np.atan2(u,v)

Now if we have maybe the u-component of the velocity sensor on the instrument fail or go offline for whatever, the u vector will start publishing fill_values (or nothing at all which we then represent as fill_values).

Essentially my goal is this:

u = np.array([  1, 0, 2, -9999, -9999], dtype=np.float32)
v = np.array([0.5, 1, 2,     1,     0], dtype=np.float32)
wind_direction_var = np.ones(u.shape[0]) * -9999
not_fill = (u != -9999) & (v != -9999) 
wind_direction_var[not_fill] = wind_direction(u[not_fill], v[not_fill])

It's not my goal to hide data, I'm just not trying to run a function with missing information which would result in a resultant array with improper information. Instead of the result being the intended

array([63.43495178, 0., 45., -9999., -9999.])

without dealing with the fill values the result would be

array([63.43495178, 0., 45., -89.99427032, -90.])

Which are real values, and it would not be desired, I think.

@caseybryant
Copy link
Contributor

The intent hasn't been lost. I understand your use case. I am writing a library that should be able to support multiple use cases. I am just trying to define the library's restrictions when it comes to parameter functions.

@lukecampbell
Copy link
Member Author

For now, parameter functions are pretty simple, they can't interface with the coverage model and they don't deal with fill values directly (the only exceptions I'm aware of are QC, but that's because I'm the author and they're not used through the coverage model).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants