Pulling Awkward Arrays from or pushing them to Zarr (and maybe other storage mechanisms) #29
Replies: 9 comments
-
I met with the Zarr authors in September to talk about setting up a two-way connection. Their primary focus is genetics, but just means they need fast access to large arrays like the rest of us. There is a peculiarly to that focus, however: their datasets of interest are mostly flat arrays with a possibly jagged inner dimension. The datasets I'm interested in (from a particle physics perspective) make heavy use of record structures with named fields. If you have truly flat arrays (no records and no jaggedness), you can get away with a lightweight option. NumPy's npy/npz files are about as fast as you can get because there's no translation between disk and memory. If your data are highly compressible (such that decompression +reading fewer bytes is faster than reading more bytes), you can use npz's built-in zlib compression. HDF5 is for big, flat arrays, but it's a heavyweight option with a lot of knobs to tune before it gets efficient. But once it is properly tuned—it's what supercomputers use in HPC. Another lightweight, flat array option is blosc/bcolz, which does the decompression on demand—it has such a fast decompressor that it stores the arrays in memory in compressed form and decompresses just before each calculation. The idea is that you transfer fewer bytes from main memory to your CPU cache (same idea as the above, but for the memory-to-cache transition, rather than the disk-to-memory transition). The handling of jaggedness in Zarr isn't very efficient yet (and the users in genetics are avoiding it), but that will hopefully change in the upcoming year. ROOT pioneered jaggedness-handling, with records, but now Parquet does the same job. The Arrow library for Python, pyarrow, has the best Parquet reader/writer. At larger scales, you might consider object stores, each with a column of data in it, though that means more manual work doing conversions if you need to turn the flat arrays into structures. Awkward 0.x has a protocol for saving full awkward arrays in flat-array (or raw-blob) backends, and thus you can save awkward arrays in npy/npz files and HDF5. Upgrading that to Awkward 1 isn't a high priority, though. Another thing that you should know: ROOT is revamping its storage format to take advantage of the new ecosystem—a class names RNTuple will be replacing TTree. The new RNTuple code is being developed in such a way that it can be used separately from the rest of the ROOT framework. You might be interested in that, though it's not ready for general users yet. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the elaborate reply and various suggestions. In my case I deal with audio of varying length (most of the time), hence numpy is too limiting for a one-in-all solution for my purpose. Also, I need lazy loading (sliced access), and it seems that loading a
Actually I already started to build my database with HDF5, since it can store AwkwardArrays and I assumed this was still possible from 1.x forward (would be good to clarify this in the Awkward 0.x docs?). I'm glad I started looking for other solutions, due to the parallel writing/reading problems of HDF5. Otherwise I would have been stuck with Awkward 0.x :')
I've started experimenting with Parquet now. Although in their documentation it is all about DataFrames, with AwkwardArray It seems to do the job ^-^ It's unfortunate that Parquet cannot include metadata like HDF5, but the integration with Arrow is a nice thing.
This sounds promising! Once this new format is ready, what would be the advantages / disadvantages compared to using Parquet instead? About Zarr, if my audio data is fixed length after all (some cases), would it be faster to store that data in Zarr format for lazy sliced loading, or would Parquet with Awkward be quite similar in performance? From my understanding, Numpy arrays cannot be stored in Parquet, right? |
Beta Was this translation helpful? Give feedback.
-
The difference between NumPy's npy and npz is that npz is a collection of npy binary blobs in a ZIP file. Each of those binary blobs can be separately loaded. If you have 1-dimensional audio arrays, each can be a different length and separately loaded by putting them all in an npz file. The granularity of lazy-loading is one binary blob (array), though there can be many in the file. Also, if you put uncompressed data into npy files, they can be lazily loaded with finer granularity by loading that npz with numpy.memmap. This uses the operating system's own virtual memory, which lazily loads in chunks of (probably) 4 kB. This happens whenever you read a file anyway (it's the operating system's cache); memory-mapping lets you access that cache directly, which can be good for sequential access. Zarr gives you more control over the lazy-loading, with some advantages over HDF5. (Maybe parallel reading? I'm hazy on the details.) You can definitely store flat arrays in Parquet. The pyarrow library has a I don't see the need for jagged arrays in your case because even though every audio sample is a different length, a single audio sample is simply one-dimensional, right? If your audio samples are reasonably large (kB to MB or more) and you don't have too many of them (thousands, not millions), then you can efficiently work with them using a separate NumPy array for each audio sample. Unless this isn't your case, you might be overthinking it. The persistence layer for awkward will be reproduced in version 1.0, but not right away. (It's not one of the priority items. We should have old-to-new and new-to-old conversion functions to help with the transition, though.) |
Beta Was this translation helpful? Give feedback.
-
Thank you for explaining the difference between I think I do need JaggedArrows / lazy loading. My audio data is 2D (2 channels), can be 5 min ~ 30 min, and I want to access them database-like. Sometimes I need only channel 0, other times both channels. Sometimes I need the full length, sometimes only the first 10 seconds. Sometimes I need to iterate over all audio samples, sometimes only a selection (e.g. indexed selection). This is all possible with JaggedArrows (I thought while I was typing this...): # only channel 0
jarr_2d[:, 0]
# <JaggedArray [[0 1 2] [6 7] [10 11 12 13]] at 0x7f0943cb1710>
# iterate over selection (array 0 and 2)
selection_idx = np.array([0, 2])
for jarr in jarr_2d[selection_idx, :]:
print(jarr)
print(type(jarr[:]))
# how do I get a 2D numpy array back though?
# [[0 1 2] [3 4 5]]
# <class 'awkward.array.jagged.JaggedArray'>
# [[10 11 12 13] [14 15 16 17]]
# <class 'awkward.array.jagged.JaggedArray'>
# first 2 values only
jarr_2d[:, 0, :2]
# <JaggedArray [[0 1] [6 7] [10 11]] at 0x7f0943cb76d0>
# PROBLEM: 3D slice not possible
jarr_2d[:, :, :2]
# NotImplementedError: this implementation cannot slice a JaggedArray in more than two dimensions When using .npy, I would need to create many files, and with .npz I would be forced to use a for-loop over Somewhat related SO question (fancy indexing): https://stackoverflow.com/questions/59150837/awkward-array-fancy-indexing-with-boolean-mask-along-named-axis/59150838 Seems I misunderstood something while trying it out. Is it possible to store a 2D numpy array (of the same length) without flattening it (because then we cannot easily select channel 0)? jarr_2d[0, :]
# <JaggedArray [[0 1 2] [3 4 5]] at 0x7f0943d11750> but a return like this (not actual working code): jarr_2d[0, :] # not correct
# array([[0, 1, 2],
# [3, 4, 5]]) I would prefer to keep the 2 channel audio as a 2D numpy array. Edit: Would being able to store 2D np arrays in AwkwardArray make it impossible to export it to Arrow / Parquet? |
Beta Was this translation helpful? Give feedback.
-
See the page labeled "9/16" (page 18 in the PDF) here: https://indico.cern.ch/event/773049/contributions/3473258/ The inability to generalize was exactly why I need to rewrite it. Each depth of nested slicing was a separate implementation using different NumPy tricks, and so I only went up to 2. Now that I can write for loops in C++, I've solved the problem of slicing arbitrary depths with a recursive function. Although I like promoting the use of Awkward, I also recommend the simplest solution to a problem. If you have a lot of 2-channel audio, why not save each audio clip in a separate two-dimensional array file (in which the second dimension has length 2)? These "files" may be separately loadable binary blobs in a NumPy npz file. If you want large audio clips to be only partially loaded, then make your database a collection of real files, possibly npy, in a directory and memory-map them. Or use Zarr or HDF5. Or Dask. I don't see here any intrinsic need for jaggedness. (There are other applications that really need it, and I obviously like finding other such cases, but I honestly don't see it here.) There's no reason, for instance, to make all of your samples a single array. Assuming high-fidelity audio samples for numeric processing, not just listening, such as 2 channels of 64-bit floating point numbers at 48 kHz, 5 minutes is 220 MB and 30 minutes is 1320 MB (48000 × 60 × #minutes × 2 × 8 / 1024 / 1024). On a computer with multiple GB of RAM, you can load several of these in their entirety at any given time and do computations across them (adding or correlating or whatever). Also, the smallest ones are big enough that if you do a Python for loop over audio samples, you will not be dominated by Python overhead. Your data's already in a sweet spot where each unit is big enough to put into Python variables and small enough to fully load into memory (although if this is a problem, memory-mapping, Zarr, HDF5, or Dask). |
Beta Was this translation helpful? Give feedback.
-
Thank you for linking that PDF and the other approaches. It was very informative. I still think AwkwardArray is beneficial in my situation, even for things not directly related to performance (code simplicity/flexibility, parquet/arrow support, selecting relevant arrays with numpy arrays (instead of strings), improvements with Awkward 1.0 (3D+ array slicing), etc). I didn't fully sketch my current situation, because that is out of the scope of this GitHub issue. I would like to tell you into more detail if you're up to it? You can send me an email to address shown in my profile (https://github.com/NumesSanguis). Then we can arrange a audio/video chat, e.g. with https://zoom.us/ ? |
Beta Was this translation helpful? Give feedback.
-
I'm willing to accept that there's something about your situation that you can't use many flat arrays. (Actually, it's a pet peeve of mine when people ignore arguments that situation X is different from situation Y and therefore requires different software. I won't do the same!) That said, persistence is low on the priority list for Awkward 1. Using a storage format like npy/npz, Zarr, HDF5, Arrow, Parquet would be a matter of unpacking the Awkward array into its constituents and then re-packing it on the other side. That's pretty simple if the data structure is simple and known. It gets more complicated when you want to do it in general, for any structure, and that's the low-priority item for me. We can talk by Zoom if you set up the meeting. (Zoom it's one that I've never set up, but I've successfully joined meetings.) I'm generally available 8am to 4pm U.S. Central time. We should probably continue by email: pivarski at Princeton (edu). |
Beta Was this translation helpful? Give feedback.
-
Currently I'm attempting the (un)packing of the AwkwardArray into Parquet, as you can see here: https://stackoverflow.com/questions/59264202/awkward-array-how-to-get-numpy-array-after-storing-as-parquet-not-bitmasked I've send you an email, thank you. |
Beta Was this translation helpful? Give feedback.
-
Thank you again for helping me out. Since this issue drifted far from the original question (Zarr support), I'll be closing it. If Zarr support is still wanted, I think it's better to create a new issue for it. |
Beta Was this translation helpful? Give feedback.
-
I'm researching what is the best storage format for AwkwardArray (audio data) and Pandas DataFrames. My requirements are sliced access to arrays, which HDF5 seems best for.
However, in a multi-process / multi-machine setting HDF5 is very limited in terms of reading, and especially writing: h5py/h5py#1459
I looked at ROOT, but since I'm not in the HEP field, it is quite a huge install for just I/O access.
It also seems not be too fast for reading? Since people are recommending temporary storage in HDF5 / .npy: https://stackoverflow.com/questions/58817554/what-the-fastest-most-memory-efficient-way-of-opening-a-root-ntuple-for-machine
You also mention Zarr as possible solution.
Is there any intention to support Zarr?
Beta Was this translation helpful? Give feedback.
All reactions