Authors | Pearu Peterson |
Created | 2021-01-15 |
Currently, PyTorch implements the following multi-dimensional array storage formats:
- Strided: an array is defined by shape, strides, and 1D values (PyTorch internal buffer).
- COO: an array is defined by shape, indices (PyTorch 2D Strided tensor), and values (PyTorch 1D Strided tensor).
Here we do not consider hybrid COO tensors for simplicity.
For matrix multiplication of sparse tensors, the COO storage format is suboptimal and more memory/processor efficient storage formats are needed, especially those that allow parallel processing of array data with minimal cache misses and to execute in streaming fashion.
We propose implementing the following new storage formats in PyTorch:
-
CSR: a 2D array is defined by shape, compressed row indices (PyTorch 1D tensor), column indices (PyTorch 1D tensor), and values (PyTorch 1D tensor).
-
DM: an array is defined by shape, dimensions map, and storage array (an arbitrary PyTorch tensor).
The DM (Dimensions Mapping) storage format is a new concept that constitutes of dimensionality reduction and promotion of the storage array as defined by dimensions map parameters (see the DM storage specification below). The DM storage array is a PyTorch tensor that can use any of the existing storage formats (Strided, COO) or any of the proposed new storage formats: CSR and DM. The dimensionality of the storage array must not be greater than the dimensionality of the DM array.
The main feature of the DM storage format is to provide an one-to-one mapping between the elements of the "wrapper" array and of the "storage" array when the dimensionalities of the arrays are different. Such mapping of the array elements is realized in between the indices sets of the wrapper and storage arrays and can be described in multiple ways. Our description of the mapping is as follows:
-
shape
parameter is a list of wrapper array dimensions. For example, ifshape=(3, 4, 5)
then the wrapper array is a 3D array with the first dimension having size3
, the second4
, and the last5
. -
dimensions
parameter is a permutation list of wrapper array dimensions. For example, ifdimensions=(2, 1, 0)
then the first and the last dimensions are considered swapped when applying the dimensions partitioning as decribed below. -
partitioning
parameter is a list ofdimensions
indices defining the partition of the wrapper array dimensions (after the permutation) to the dimensions of the storage array. The dimensionality of the storage array must be equal tolen(partitioning) + 1
. For example, ifpartitioning=(1,)
anddimensions=(2, 1, 0)
then the dimensionality of the storage array is2
and the last dimension (with size5
) of the wrapper array is mapped to the first dimension and the rest dimensions (with sizes4
and3
) to the second dimension of the storage array, respectively. As a result, the shape of the storage array will be(5, 12)
.
While the dimensions
and partitioning
parameters define the
one-to-one mapping between the elements of the wrapper and storage
array, internally, the DM storage format uses a different set of the
DM storage format parameters: rdimensions
, rstrides
, and roffset
which can be computed from dimensions
and partitioning
parameters
but are more convinient to describe the slicing of DM formatted
arrays. The exact definition of these parameters is implemented in
the source code of the DM format
prototype.
While CSR format supports more performant Linear Algebra operations with 2D sparse arrays (see the benchmark results in PyTorch PR 44190 description), the DM format provides a method of generalizing 2D CSR formatted arrays to multidimensional arrays. In addition, the DM format allows several new perspectives to multidimensional array representation:
-
no-copy slicing and axis swapping for COO and CSR/CSS tensors.
-
hierarchical storage format: DM tensor can use another DM tensor as an storage array.
-
extending storage formats with restricted dimensioniality (e.g. the CSR format is essentially for 2D arrays only) to arbitrary dimensionality.
A pure Python prototype of DM storage format is provided in GCS package where the dimensions mapping and slicing of DM arrays are implemented and demonstrated on wrapping of storage arrays using various storage formats like Strided, COO, FlatCOO, and CSR. See also Gentle Introduction To GCS that exemplifies the dimensions mapping concepts for CSR storage array in more detail.
- Implement CSR storage format, the implementation exists in the form of PyTorch PR 44190 but it requires some clean up.
- Implement DM storage format in storage array format agnostic way.
The PyTorch PR 44190 is kept for reference but will be retired eventually.