You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a necessarily a bug report about the PackBits codec, though there is some buggy/unexpected behavior that I'd like to discuss towards the end. I'm primarily interested in the way the filter can be used for multidimensional arrays and potentially enhanced for this setting. Currently, the way the filter operates is as follows:
Normalize input to get bool data type:
arr=ensure_ndarray(buf).view(bool)
Flatten the array:
arr=arr.reshape(-1, order='A')
Determine if any padding needs to be done, and if so, store the number of bits that need to be padded. Then, we call numpy to pack the flattened array:
n=arr.sizen_bytes_packed= (n//8)
n_bits_leftover=n%8ifn_bits_leftover>0:
n_bytes_packed+=1# setup outputenc=np.empty(n_bytes_packed+1, dtype='u1')
# store how many bits were paddedifn_bits_leftover:
n_bits_padded=8-n_bits_leftoverelse:
n_bits_padded=0enc[0] =n_bits_padded# apply encodingenc[1:] =np.packbits(arr)
However, when decoding, we no longer have information about the original shape, unless the user passes an out array that's shaped correctly as the original input. This is a bit limiting in a way and makes it hard to deploy the Codec more generically. So, as an alternative, I modified the implementation to not only store the bits to be padded, but also the number of dimensions in the original array as well as the dimension sizes. The way the new encoding works is as follows:
Normalize input array
Store information about its shape / dimensions.
2.1) Store how many dimensions the array has.
2.2) If the number of dimensions is greater than 1, then for each dimension:
2.2.1) Assuming the dimension size is uint64 integer, pack this integer into 8 uint8 integers.
This way, we're going to have to add up to 2 + 8*ndim bytes to the flattened array, which should be tiny for larger multi-dimensional arrays.
So, if the array is flat, we're only adding 1 byte. If it is multi-dimensional, the number of bytes we add is proportional to the number of dimensions. This is how the implementation looks like:
classPackBits(numcodecs.abc.Codec):
"""Codec to pack elements of a boolean array into bits in a uint8 array. Examples -------- >>> import numcodecs >>> import numpy as np >>> codec = numcodecs.PackBits() >>> x = np.array([True, False, False, True], dtype=bool) >>> y = codec.encode(x) >>> y array([ 4, 144], dtype=uint8) >>> z = codec.decode(y) >>> z array([ True, False, False, True]) Notes ----- The first element of the encoded array stores the number of bits that were padded to complete the final byte. """codec_id='packbits'def__init__(self):
pass#self.dim_order = Nonedefencode(self, buf):
# normalise inputarr=ensure_ndarray(buf).view(bool)
# ------------------------------------# Store array shape:# Determine the dimension-related information to store:ndims=len(arr.shape) # The number of dimensions# The size of each dimension will be stored as uint64 number, # corresponding to 8 int8 integers. If the array is flat already, # dimension sizes will not be stored.ifndims>1:
dim_sizes= [int8codefordinarr.shapeforint8codeinstruct.pack('Q', d)]
else:
dim_sizes= []
# ------------------------------------# Flatten to simplify implementationarr=arr.reshape(-1, order='A')
# determine size of packed datan=arr.sizen_bytes_packed= (n//8)
n_bits_leftover=n%8ifn_bits_leftover>0:
n_bytes_packed+=1# Setup outputenc=np.empty(n_bytes_packed+2+len(dim_sizes), dtype='u1')
# Determine how many bits were paddedifn_bits_leftover:
n_bits_padded=8-n_bits_leftoverelse:
n_bits_padded=0# Store how many bits were paddedenc[0] =n_bits_padded# Store how many dimensions:enc[1] =ndims# Store dimension sizes:iflen(dim_sizes) >0:
enc[2:len(dim_sizes) +2] =dim_sizes# Apply encodingenc[2+len(dim_sizes):] =np.packbits(arr)
returnencdefdecode(self, buf, out=None):
# normalise inputenc=ensure_ndarray(buf).view('u1')
# flatten to simplify implementationenc=enc.reshape(-1, order='A')
# ----------------------------------# Figure out dimension / dimension sizes related information:ndims=enc[1]
shapes= []
# If more than 1D, extract dimension information:ifndims>1:
shape= []
foriinrange(ndims):
shapes.append(struct.unpack('Q', bytes(enc[2+i*8:2+ (i+1)*8]))[0])
# ----------------------------------# Find out how many bits were paddedn_bits_padded=int(enc[0])
# Apply decodingdec=np.unpackbits(enc[2+len(shapes)*8:])
# Remove padded bitsifn_bits_padded:
dec=dec[:-n_bits_padded]
# View as boolean arraydec=dec.view(bool).reshape(shapes)
# Handle destinationreturnndarray_copy(dec, out)
Possible improvements over this implementation:
Instead of storing dimension sizes as uint64, we can use uint32. With this, the number of bytes we'll need to add is 2 + 4*ndim. I went with uint64 to be on the safe side and accommodate huge arrays, though we're unlikely to have those in practice (since this filter will apply on chunks of larger arrays anyway).
One thing that would be nice to have is to allow the user to determine the order of the flattening, as this can have some consequences for the packing or compression quality. We could potentially take flatten_order as an input when initializing the Codec and use this in the encode method.
When experimenting with this implementation, I noticed some unexpected behavior due to the normalization step, in particular with the .view(bool) call. I understand that the Codec expects bool as inputs, but if I pass it anything else, for example, an array of 0/1 with dtype set to int16 or int32, then calling .view(bool) on it will distort its shape. This behavior is noted in the .view() documentation for numpy:
For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the last axis of a must be contiguous.
So, what happened a few times is that I generated a random array with np.random.choice(2, size=(10, 20)) and when passing it to the codec without converting to bool first, I got unexpectedly shaped arrays. In this case, I'd say we should either raise an error when the input data type is not bool or occupies the same number of bytes as bool, or alternatively, convert to bool with a copy.
Version information
Python 3.11.5
numcodecs 0.12.1
The text was updated successfully, but these errors were encountered:
Hi everyone,
This is not a necessarily a bug report about the
PackBits
codec, though there is some buggy/unexpected behavior that I'd like to discuss towards the end. I'm primarily interested in the way the filter can be used for multidimensional arrays and potentially enhanced for this setting. Currently, the way the filter operates is as follows:bool
data type:numpy
to pack the flattened array:However, when decoding, we no longer have information about the original shape, unless the user passes an
out
array that's shaped correctly as the original input. This is a bit limiting in a way and makes it hard to deploy the Codec more generically. So, as an alternative, I modified the implementation to not only store the bits to be padded, but also the number of dimensions in the original array as well as the dimension sizes. The way the new encoding works is as follows:2.1) Store how many dimensions the array has.
2.2) If the number of dimensions is greater than 1, then for each dimension:
2.2.1) Assuming the dimension size is
uint64
integer, pack this integer into 8uint8
integers.2 + 8*ndim
bytes to the flattened array, which should be tiny for larger multi-dimensional arrays.So, if the array is flat, we're only adding 1 byte. If it is multi-dimensional, the number of bytes we add is proportional to the number of dimensions. This is how the implementation looks like:
Possible improvements over this implementation:
uint64
, we can useuint32
. With this, the number of bytes we'll need to add is2 + 4*ndim
. I went withuint64
to be on the safe side and accommodate huge arrays, though we're unlikely to have those in practice (since this filter will apply on chunks of larger arrays anyway).flatten_order
as an input when initializing the Codec and use this in theencode
method..view(bool)
call. I understand that theCodec
expectsbool
as inputs, but if I pass it anything else, for example, an array of 0/1 withdtype
set toint16
orint32
, then calling.view(bool)
on it will distort its shape. This behavior is noted in the.view()
documentation fornumpy
:So, what happened a few times is that I generated a random array with
np.random.choice(2, size=(10, 20))
and when passing it to the codec without converting tobool
first, I got unexpectedly shaped arrays. In this case, I'd say we should either raise an error when the input data type is notbool
or occupies the same number of bytes asbool
, or alternatively, convert tobool
with a copy.Version information
The text was updated successfully, but these errors were encountered: