authors: Thomas Robitaille, Perry Greenfield, Matt Craig
date-created: 2014 October 2
date-last-revised: 2014 December 17
date-accepted: 2014 December 17
type: Standard Track
status: Accepted
This APE is intended to provide a long-term plan for the astropy.nddata
sub-package. The package has been the subject of continuous debate since the
start of the astropy project, and has changed in scope several times, so this
APE is aimed at agreeing on the scope and future of the sub-package and
providing a well-structured foundation on which future development of
astropy.nddata
can occur.
At the first Astropy coordination meeting in 2011, it was decided that as well
as having a generic table container, it would be useful to have a generic
container for gridded data. The astropy.table
package has since then seen a
large amount of development, and the API has now stabilized. The added value of
the table package compared to using simple Numpy structured arrays is clear -
the Table
class makes it very easy to do common operations on tables such
as adding or removing columns or rows, and reading/writing tables to common
file formats.
On the other hand, NDData
development has stagnated and we have not been
able to converge on a stable API. Part of this is due to the fact that there is
in fact a huge variety of 'n-dimensional datasets' and that there is very
little in common for example between a spectrum and an image, in terms of what
can be done with them. This has prevented the NDData
class from including
much functionality, and in fact we have been adding functionality then removing
it after we have realized that it is not general enough.
An example to illustrate this issue is that of arithmetic - what happens when we add two n-dimensional datasets? Of course, the data values can be added, but what happens to the mask, to the flags, or to the uncertainties? The truth is that there is no general recipe for dealing with this, and that it may be a mistake to try and define it in such a general way.
This APE takes the approach of re-thinking the purpose of NDData and trying to define a scope for the future.
First, why do we actually need a generic data container that can be sub-classed? What is the benefit of this versus simply defining separate base classes for spectra, images, and other types of data? There are several possible answers:
- We want to provide users with a consistent experience across data objects -
that is, the user should know that meta-data is always consistently named
meta
, that a mask can always be accessed withmask
, and that the data can be accessed withdata
. - By providing a common base class which can define a unified I/O interface
which taps into the astropy I/O registry, we can seamlessly make it that
all data objects have
read
andwrite
methods that behave consistently. - We want functions and methods in Astropy to know if an object passed to them
is an n-dimensional data object that can be expected to have specific
attributes (such as
wcs
,mask
, and so on).
In principle, none of these require a base class. We could simply agree on a
standard for data objects that defines what certain attributes should be
called. This is a valid solution, but at the same time, having a base class can
enforce this and factor out some boilerplate code, and as described in
Alternatives, the questions raised here would apply to a base Image
class that was sub-classed as XRayImage
, CCDImage
, and so on.
The first proposal in this APE is to separate a definition of the NDData
interface from concrete realizations of NDData
-like objects by creating an
abstract base class called NDDataBase
.
The proposed NDDataBase
simplifies the current NDData
class to the extreme,
such that it only defines which properties are needed for NDData
.
The NDDataBase
class should not define any arithmetic operations, which are
impossible to generalize.
The following properties should be included in the base class:
data
- the data itself. No restrictions are placed on the type of this data in the ABC. Subclasses could, for example, make it a plain Numpy array, masked Numpy array, an Astropy Quantity, or an h5py data buffer. Subclasses could also require, for example, thatdata
provides ashape
attribute and/or be sliceable in order to 'prove' that it is an n-dimensional data object.mask
- the mask of the data, following the Numpy convention of True meaning masked, and False meaning unmasked. Sub-classes could choose to connect this todata.mask
. Masks do not need to be Numpy arrays, they could be for example 'lazy' masks based on functions that will be evaluated on-the-fly.unit
- the unit of the data values, which will be internally represented as an Astropy Unit. If present, subclasses should try to ensure numerical operations properly take into account and propagate units. Sub-classes could choose to connect this todata.unit
, in which case data should be aQuantity
or behave like it.wcs
- an object that can be used to describe the relationship between input and world coordinates. This can (but does not have to) be an Astropy WCS object. Once the generalized WCS system is in place in Astropy, we will probably require this to be such an object. Subclasses are free to be more restrictive in what they permit for thewcs
object.meta
- a dict-like object that can be used to contain arbitrary metadata. This could be a plain Python dict, an ordered dict, a FITS Header object, and so on, provided that it offers dict-like item access and iteration.uncertainty
-- an object that represents the uncertainty in the data for each element on the array. This APE places no restriction on what type of uncertainty this is (e.g. variance, standard deviation, or posisson count rate), nor does it require the attribute to be set at all (other than defaulting toNone
). It places only one restriction onuncertainty
: it must have an attributeuncertainty_type
, which should be a human-readable string.While not a requirement, the following
uncertainty_type
strings are strongly recommended for common ways of specifying normal distributions:"std"
: ifuncertainty
stores the standard deviation/sigma (either a single value or on a per-pixel basis)."var"
: ifuncertainty
stores the variance (either a single value or on a per-pixel basis)."ivar"
: ifuncertainty
stores the inverse variance (either a single value or on a per-pixel basis).
The only required attribute is data
; all others default to None
if
not initialized or overridden in a subclass.
The NDData
class would be a concrete class that is far less restrictive
than the current NDData
while still providing some basic functionality as
a container. It will place loose restrictions on what data
can be set to
and is described further in Implementation.
The NDData
class would not include methods such as __array__
,
__array_prepare__
, and so on which allow a class to be treated as a Numpy
array. This behavior has been identified as being potentially ambiguous in
the general case because it will depend on the details of e.g. how masks are
handled and also does not make it explicit in what units the data is required.
Specific functionality such as uncertainty handling and arithmetic can be
developed as mix-in classes that can be used by NDData
sub-classes.
Generic slicing capabilities, further described in Implementation, will be
provided as a mixin class called NDSlicing
If a user has a data object such as an image, it would be nice if they can use
functions directly on this image and have them return an image object. At the
same time, we do not want to force people to use special data containers if
they have for example a Numpy array and a WCS object. This raises the question
of whether we should duplicate the API for all functions, to provide one
interface for NDData
subclasses, and one for separate attributes. The
proposal in this APE is that functions should only define a single API that
takes separate keyword arguments for e.g. data
, mask
, and so on, but
that we then provide a way for users to be able to call these functions with
NDData
sub-classes (see Implementation).
NDDataBase
will be implemented as an abstract base class. The only input
validation it will provide is enforcing the existence of an
uncertainty_type
attribute if uncertainty
is not None
, as described
above.
We will not include setters for properties except mask
and uncertainty
because it is ambiguous what the meaning of setting e.g. the unit or WCS after
initialization means: it could either mean to change the unit or WCS, or it
could mean that the user wants to convert the data to this new unit or WCS.
Given this ambiguity, it is safer to not have setters for the core attributes
and this is consistent with e.g. Quantity
.
NDData
will be a concrete subclass of NDDataBase
that provides some logic for handling the setting of data
:
- If the object passed in as
data
has ashape
attribute, is sliceable, and has an__array__
method, so that it can be easily used as a numpy array, thenNDData.data
will be set to that object. - Otherwise,
NDData
will attempt to create anumpy.ndarray
from the inputdata
and use that as the internal representation of the data.
The read
and write
methods will be developed via a mixin class.
This APE suggests adding a mixin class, NDSlicing
, to handle basic
slicing. This could be done by simply having code similar to the following
inside __getitem__
:
def __getitem__(self, slice): new = self.__class__() if self.data is not None: new._data = self.data[slice] if self.mask is not None: new._mask = self.mask[slice] if self.wcs is not None: new._wcs = self.wcs[slice] ...
Note that this is only meant as an illustration of the idea suggested here,
and the final implementation will likely differ from this - but the basic
idea is that the slicing would be delegated to the member attributes. For
example, the WCS class would need to define itself how it should be sliced.
Some attributes (such as meta
) would not necessarily need to be sliceable.
Note the effect of slicing on attributes presumably returns a similar object, e.g., for wcs, it returns a new WCS appropriate to the sliced data attribute.
The arithmetic methods currently in NDData
will be implemented in a mixin
called NDArithmetic
.
One question that has come up as part of several affiliated packages is how
to deal with NDData
objects in functions. For example, if we consider a
downsample
function that can downsample an image, should the function
accept only NDData
(or sub-class) objects? Should it also
accept plain Numpy arrays? If so, how do we pass any additional meta-data
such as WCS? Should we return a downsampled Numpy array and downsampled WCS,
or a single downsampled NDData
(or sub-class) instance? In this example, one option would
be to provide two APIs, one for NDData
and/or sub-classes and one for separate Numpy arrays
and attributes, but maintaining two parallel APIs is not an ideal solution.
An alternative is for each function to encode the logic of checking the input
type and deciding on the output type based on the output type. However, this
means repeating a lot of similar code such as:
def downsample(data, wcs=None) if isinstance(data, NDData): if wcs is not None: raise ValueError("wcs cannot be specified if NDData instance was passed") wcs = data.wcs data = data.data
and this will become a lot more complex once more attributes are needed by the function.
In order to make it easier for functions to accept NDData
sub-classes and
return these, we can implement a decorator that will automatically split up an
NDData
object as needed. Let us consider the following function:
def test(data, wcs=None, unit=None, n_iterations=3): ...
We can provide a decorator called e.g. support_nddata
:
@support_nddata def test(data, wcs=None, unit=None, n_iterations=3): ...
which makes it so that if the user passes an NDData
sub-class called e.g.
nd
, the function would automatically be called with:
test(nd.data, wcs=nd.wcs, unit=nd.unit)
That is, the decorator looks at the signature of the function and checks if any
of the arguments are also properties of the NDData
object, and passes them
as individual arguments.
An error could be raised if an NDData
property is set but the function does
not accept it - for example, if wcs
is set, but the function cannot support
WCS objects, an error would be raised. On the other hand, if an argument in the
function does not exist in the NDData
object or is not set, it is simply
left to its default value. This behavior could be customizable but the
details are beyond the scope of this APE document.
If the function call succeeds, then the decorator will make a new NDData
object (with the correct class) and will populate the properties as needed. In
order to figure out what is returned by the function, the decorator will need
to accept a list which gives the name of the output values:
@support_nddata(returns=['data', 'wcs']) def test(data, wcs=None, unit=None, n_iterations=3): ...
Finally, the decorator could be made to restrict input to specific NDData
sub-classes (and sub-classes of those):
@support_nddata(accepts=CCDImage, returns=['data', 'wcs']) def test(data, wcs=None, unit=None, n_iterations=3): ...
With this decorator, the functions could be seamlessly used either with
separate arguments (e.g. Numpy array and WCS) or with subclasses of
NDData
such as CCDImage
.
The Quantity
class would benefit from the ability to share the same
interface that NDData provides and to tap into NDData's metadata and WCS
handling.
Because subclassing from numpy.ndarray
involves subtleties that differ
from typical subclassing in python, an example subclass called NDQuantity
may be implemented as part of astropy.nddata
. Should it turn out to be
unreasonably difficult to do, an attempt may be made to implement a class
which uses Quantity
as the data store, with the ability to link NDData
properties like unit
to the underlying properties in Quantity
. Should
that also prove to be unworkable, an explanation of the issues that prevented
implementation may be provided in the documentation for NDData
.
Initial decorator implementation: astropy/astropy#2855
Initial refactoring of NDData: astropy/astropy#2905
This APE will require packages such as specutils
and ccdproc
to
completely refactor how they use the NDData
class. This will also break
compatibility with users currently using NDData
directly, but this is
assumed to be a very small fraction (if any) of users.
One alternative is to remove the NDData
class altogether and to start
the base classes at the level of Spectrum
or Image
. In this case many
of this ideas of this APE (including the attribute names, decorators, etc.)
would still apply to these base classes. The benefits of having a base
NDData
class instead of starting at the Image
and Spectrum
level
are that:
- The
NDData
class enforces the naming of the base properties to ensure consistency across all sub-classes. - It allows slicing to be implemented at the core level as a mixin, whereas
this would need to be repeated in each base class if we had e.g.
Spectrum
,Image
,SpectralCube
as the base classes. - It allows the connection to the unified I/O framework to be defined once, whereas this would also need to be repeated in each base class otherwise.
On the other hand, the downsides of having a core NDData
class is that it
reduces flexibility of the sub-classes - for instance Spectrum
has to be
implemented taking into consideration the restrictions on e.g. attribute
names defined by the sub-classes. In the
spectral-cube package, at the moment
we do not have a data
attribute because we have a custom masking
framework and define attributes like unmasked_data
. Of course, we should
aim to make this more compliant with what is decided here, but this is just
to demonstrate that this type of flexibility may be lost. However, this may
be a good thing as it enforces consistency for users.
The original implementation of the NDData
class behaved like a numpy
ndarray
; an alternative to making NDData
a more generic container is
to make it a full-fledged subclass of ndarray
or of Quantity
. The
advantage of this approach is that it potentially reduces duplication of code
by using the infrastructure of Quantity
and/or nddata
.
It has the disadvantage of reducing the flexibility of NDData
and presents
the challenge of handling the attributes (especially meta
, mask
and
wcs
) in a sensible way for arbitrary operations on an NDData
. Even in
one of the most straightforward cases, the addition of two NDData
objects
with metadata, it is unclear what the meta
of the result should be.
There is a need for a more generic container with metadata than would be
possible if subclassing from ndarray
. In addition, it would be
straightforward to implement a subclass of the NDData
proposed in this APE
that ties the unit
and (when they are available in `Quantity
) mask
and uncertainty
to those properties of the data
attribute. In other
words, a subclass which is essentially a Quantity
with meta
wrapped in
the NDData
interface is straightforward.
If NDData
subclasses from ndarray
then it will be difficult or
impossible to subclass a more generic container from it, which is likely to
lead, down the road, to the need for the type of generic container proposed in
this APE.
This APE led to lengthy discussion both on the mailing list and in the astropy-APEs Pull Request. This produced some major changes to the APE, leading to the current form. A final vote was held, with unanimous support. Hence this APE was accepted 2014 December 17.