Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iteration over Zarr arrays is significantly slower in v3 #2529

Open
tomwhite opened this issue Dec 3, 2024 · 14 comments
Open

Iteration over Zarr arrays is significantly slower in v3 #2529

tomwhite opened this issue Dec 3, 2024 · 14 comments
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.)
Milestone

Comments

@tomwhite
Copy link
Contributor

tomwhite commented Dec 3, 2024

Zarr version

3.0.0b2

Numcodecs version

0.14.0

Python Version

3.11

Operating System

Mac

Installation

pip

Description

Iterating over elements or slices of a Zarr array z using iter(z) is much slower in v3 than v2.

In v2 Array implements __iter__, which caches chunks, whereas in v3 Array does not implement __iter__ so iter(z) falls back to calling z[0], z[1], etc, which means that every element or slice has to load the chunk again.

This seems like a regression, but perhaps there was a reason for not implementing __iter__ on Array? The thing is that code will still work in v3, but will be a lot slower (possibly orders of magnitude), and it's quite hard to track down what is happening.

(There's an interesting Array API issue about iterating over arrays which may be relevant here: data-apis/array-api#818.)

Steps to reproduce

See sgkit-dev/bio2zarr#288 (comment)

Additional output

No response

@tomwhite tomwhite added the bug Potential issues with the zarr-python library label Dec 3, 2024
@d-v-b
Copy link
Contributor

d-v-b commented Dec 3, 2024

I think we just haven't implemented __iter__ in v3, I don't recall any explicit opposition to it.

I do think this kind of iteration is intrinsically inefficient for most zarr arrays, because it doesn't take any account of the chunk structure. Even if __iter__ caches a single chunk at a time, it might fetch + decompress the same chunk multiple times. To minimize IO, __iter__ should cache all the chunks that intersect with the spatial index, but this is not safe in general for large arrays.

The most efficient iteration would be to iterate over chunks, then iterate over each chunk. This of course is not possible if we choose to emulate numpy-style iteration.

@jeromekelleher
Copy link
Member

jeromekelleher commented Dec 3, 2024

I do think this kind of iteration is intrinsically inefficient for most zarr arrays, because it doesn't take any account of the chunk structure. Even if iter caches a single chunk at a time, it might fetch + decompress the same chunk multiple times.

I don't follow this - if it iterates over the chunks in the first dimension, then that's about as good as you can do, given the definition of the operation? Something like

def iter(z):
    for chunk in range(z.cdata_shape[0]):
        A = z.blocks[chunk]
        for a in A:
            yield a

(Just FYI, I added the iter functionality to zarr 2 and find it a useful operation)

@d-v-b
Copy link
Contributor

d-v-b commented Dec 3, 2024

Correct me if I'm misunderstanding something -- suppose we have an array a with shape (8, 7) and chunks sized (7, 1). a has 8 total chunks. a[0] will return an array with length 7 comprised of the 0th element of all 8 chunks, and similarly for a[1] etc. If during iteration only 1 chunk is cached at a time, then the each chunk must be read + decompressed each time a[idx] is invoked. Of course if a had chunks sized (1, 7), then caching each chunk during iteration would be quite efficient.

The basic problem is that the IO-optimal way of iterating over zarr arrays requires knowledge of how they are chunked, but the behavior of __iter__ we will implement is fundementally a numpy convention that assumes that fetching data from memory is cheap, and so it is completely ignorant of the chunk layout, which is fine for numpy because numpy arrays are in memory and people don't need to worry too much about the memory layout. The latter does not hold for zarr.

(edit I had transposed my efficient and inefficient chunks)

@d-v-b
Copy link
Contributor

d-v-b commented Dec 3, 2024

to illustrate with some examples from zarr==2.18:

this requires reading every chunk per iteration:

>>> [x for x in zarr.zeros((3,2), chunks=(3,1))]
[array([0., 0.]), array([0., 0.]), array([0., 0.])]

this requires reading 1 chunk per iteration:

>>> [x for x in zarr.zeros((3,2), chunks=(1,2))]
[array([0., 0.]), array([0., 0.]), array([0., 0.])]

for isotropic chunk sizes, some caching is better than none, but this kind of iteration will still be inefficient compared to a chunk-aware iteration API

@jeromekelleher
Copy link
Member

Sorry, I don't get it (I must be missing something basic here!)

def first_dim_iter(z):
    for chunk in range(z.cdata_shape[0]):
        A = z.blocks[chunk]
        for a in A:
            yield a

Surely the numpy array A here is the cached result of each chunk, in turn, of the first dimension of z. It makes no difference how the subsequent dimensions are chunked, right? And, given that you must keep the first dimension cached here to have anything like reasonable performance, then this is basically optimal (assuming you want to iterate over every "row" in the first dimension).

So this is chunk aware, right?

@d-v-b
Copy link
Contributor

d-v-b commented Dec 3, 2024

Sorry, I don't get it (I must be missing something basic here!)

def first_dim_iter(z):
for chunk in range(z.cdata_shape[0]):
A = z.blocks[chunk]
for a in A:
yield a

Surely the numpy array A here is the cached result of each chunk, in turn, of the first dimension of z. It makes no difference how the subsequent dimensions are chunked, right? And, given that you must keep the first dimension cached here to have anything like reasonable performance, then this is basically optimal (assuming you want to iterate over every "row" in the first dimension).

So this is chunk aware, right?

That function will potentially load an entire array into memory (see the first case), which we do not want:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "zarr==2.18",
# ]
# ///
import zarr
import numpy as np

def first_dim_iter(z):
    for chunk in range(z.cdata_shape[0]):
        A = z.blocks[chunk]
        print('A has shape ', A.shape)
        for a in A:
            yield a

shape = (8,7)
chunks = ((8,1), (1, 7), (3,3))
for _chunks in chunks:
    print('chunks', _chunks)
    z = zarr.zeros(shape=shape, chunks=_chunks)
    z[:] = np.arange(np.prod(shape)).reshape(*shape)
    gen = first_dim_iter(z)
    print(next(gen))
    print(next(gen))

produces

chunks (8, 1)
A has shape  (8, 7)
[0. 1. 2. 3. 4. 5. 6.]
[ 7.  8.  9. 10. 11. 12. 13.]
chunks (1, 7)
A has shape  (1, 7)
[0. 1. 2. 3. 4. 5. 6.]
A has shape  (1, 7)
[ 7.  8.  9. 10. 11. 12. 13.]
chunks (3, 3)
A has shape  (3, 7)
[0. 1. 2. 3. 4. 5. 6.]
[ 7.  8.  9. 10. 11. 12. 13.]

@d-v-b
Copy link
Contributor

d-v-b commented Dec 3, 2024

it looks like the implementation of Array.iter in v2 had the same behavior -- in the worst case of misalignment between the chunks and the iteration, the entire array might be loaded into memory.

Should we add this to v3 and treat it like a "user beware" situation where people should know how the array is chunked before they iterate over it? Personally, I would be unpleasantly surprised if I tried to iterate of a chunked array and incidentally fetched all the chunks (but I work with TB sized data.). I would generally assume that a "big data first" library would keep surprises like that to a minimum.

More broadly, I'm not a fan of copying numpy APIs like __iter__ that implicitly rely on arrays existing in memory, since Zarr arrays are very much not implicitly in memory.

@jeromekelleher
Copy link
Member

That function will potentially load an entire array into memory (see the first case), which we do not want:

Sorry I'm sure I'm being thick here, but I'm still not getting it. You could only load the entire array into memory if there was only one chunk in the first dimension, right? Having a single chunk in the first dimension of a large array seems quite an odd situation to me? I'm used to working with arrays with a large first dimension, with many chunks, and this pattern works well.

I work with TB scale data as well and I entirely agree that unpleasant surprises like loading an entire array into memory behind the scenes shouldn't happen. But surprises like

for a in z:
     # do something with a

decompressing each chunk, chunksize times are also not entirely pleasant.

@rabernat
Copy link
Contributor

rabernat commented Dec 3, 2024

Having a single chunk in the first dimension of a large array seems quite an odd situation to me?

Let me assure you that this is extremely common. Zarr users exercise every possible allowed permutation of chunk shapes to satisfy different data access patterns. There is nothing special about the first dimension in any way.

@jeromekelleher
Copy link
Member

Thanks @rabernat, that's why I'm confused then!

But in this one chunk case, isn't the whole array getting loaded into memory anyway each time we access a single element, during iteration?

@d-v-b
Copy link
Contributor

d-v-b commented Dec 3, 2024

But in this one chunk case, isn't the whole array getting loaded into memory anyway each time we access a single element, during iteration?

In the worst case, the entire array is loaded into memory just once (on the first invocation of next(array.__iter__), and for each subsequent iteration that in-memory copy of the array is iterated over. In the best case, each chunk is loaded just once, exactly when it's needed.

I think the following example illustrates it. I'm using a subclass of MemoryStore that prints whenever __getitem__ is invoked to indicate where IO would be happening:

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "zarr==2.18",
# ]
# ///

import zarr
import numpy as np
class LoudStore(zarr.MemoryStore):
    def __getitem__(self, key):
        print(f"reading {key}")
        return super().__getitem__(key)

shape = (8,7)
chunks = ((shape[0], 1), (1, shape[1]))
for _chunks in chunks:
    print(f'\nchunks: {_chunks}')
    z = zarr.zeros(shape=shape, chunks=_chunks, store=LoudStore())
    z[:] = np.arange(z.size).reshape(shape)
    gen = z.islice()
    for it in range(2):
        print(f'iter {it}:')
        print(next(gen))

which produces the following output:

chunks: (8, 1)
reading .zarray
iter 0:
reading 0.0
reading 0.1
reading 0.2
reading 0.3
reading 0.4
reading 0.5
reading 0.6
[0. 1. 2. 3. 4. 5. 6.]
iter 1:
[ 7.  8.  9. 10. 11. 12. 13.]

chunks: (1, 7)
reading .zarray
iter 0:
reading 0.0
[0. 1. 2. 3. 4. 5. 6.]
iter 1:
reading 1.0
[ 7.  8.  9. 10. 11. 12. 13.]

@tomwhite
Copy link
Contributor Author

tomwhite commented Dec 4, 2024

@d-v-b in v3 chunks are not cached though, so it's much slower, which is why I opened this issue. This is the worse of both worlds as iteration works, but is terribly inefficient.

I think there's a case for implementing __iter__ for 1-D arrays since that is unambiguous, but raising an exception for higher-dimensional arrays (see data-apis/array-api#821). There could perhaps be some utility methods for iterating over Zarr arrays, like Jerome's first_dim_iter or similar.

@dstansby dstansby added performance Potential issues with Zarr performance (I/O, memory, etc.) and removed bug Potential issues with the zarr-python library labels Dec 7, 2024
@dcherian
Copy link
Contributor

dcherian commented Dec 13, 2024

If you care about the order in which elements come out this iteration is only cheap is if there is a single chunk along the last dimension (for order='C') or the first dimension (for order='F').

Anything else will require some handling of potentially large slices & complicated chunk caches whose behaviour depends on the chunking of the array

(I like this image from the dask docs that illustrates the problem).

So perhaps we bring it back but just add the chunksize checks to avoid potentially large memory usage.


It is always possible to just iterate through blocks, and flatten that and yield elements as long as you do not care about order. Dask just added this recently and it can be quite useful.

@tomwhite
Copy link
Contributor Author

So perhaps we bring it back but just add the chunksize checks to avoid potentially large memory usage.

That would be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.)
Projects
None yet
Development

No branches or pull requests

6 participants