-
I was exploring if awkward ragged arrays could save space when pickled compared to python lists of lists. I did a little testing and found that awkward was using more space than python lists of lists format. I expected that awkward would be smaller since it was storing in a uniform int format, while python stores each element as object.
This shows the python list format is 1/4 the size of the dump produced by the same awkward array. Any ideas on why, or other possibilities to dump ragged arrays of uniform data elements (int, float) more compactly? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
The nuance here is that the default integer type in Awkward Array is If you care about space, then you can just use compression, e.g. import io
import pickle
import numpy as np
import awkward as ak
def save_compressed(array, file_):
form, length, container = ak.to_buffers(array)
# Compress the arrays into a bytestream
data = io.BytesIO()
np.savez_compressed(data, **container)
data.seek(0)
# Use pickle for general serialisation
pickle.dump([form, length, data], file_)
def load_compressed(file_):
form, length, data = pickle.load(file_)
container = np.load(data)
return ak.from_buffers(form, length, container) This is just using DEFLATE compression; a different compressor may produce better results. |
Beta Was this translation helpful? Give feedback.
-
I see I was confused because I used a very small and non-random array for testing. If I use an array of random 64bit integers, then the size of the dump is not much larger than 8*#elements. Assuming the data is not actually compressible, this is as good as I could hope. |
Beta Was this translation helpful? Give feedback.
The nuance here is that the default integer type in Awkward Array is
int64
. So, when pickling the Awkward Array, each integer consumes 8 bytes. Meanwhile, Python's pickler knows how to densely pack integers (I think this routine: https://github.com/python/cpython/blob/2d3d9b4461d0e2cb475014868af3c2f241cb6495/Modules/_pickle.c#L2066). As such, particularly for small values, the difference between the two is stark.If you care about space, then you can just use compression, e.g.