_merge_chunk_array could allocate memory more carefully #2038

rabernat · 2024-07-14T22:21:36Z

I was reviewing the Pipeline and has some observations. A key function is _merge_chunk_array

Lines 298 to 332 in b8baa68

    
           def _merge_chunk_array( 
        
               self, 
        
               existing_chunk_array: NDBuffer | None, 
        
               value: NDBuffer, 
        
               out_selection: SelectorTuple, 
        
               chunk_spec: ArraySpec, 
        
               chunk_selection: SelectorTuple, 
        
               drop_axes: tuple[int, ...], 
        
           ) -> NDBuffer: 
        
               if is_total_slice(chunk_selection, chunk_spec.shape) and value.shape == chunk_spec.shape: 
        
                   return value 
        
               if existing_chunk_array is None: 
        
                   chunk_array = chunk_spec.prototype.nd_buffer.create( 
        
                       shape=chunk_spec.shape, 
        
                       dtype=chunk_spec.dtype, 
        
                       order=chunk_spec.order, 
        
                       fill_value=chunk_spec.fill_value, 
        
                   ) 
        
               else: 
        
                   chunk_array = existing_chunk_array.copy()  # make a writable copy 
        
               if chunk_selection == () or is_scalar(value.as_ndarray_like(), chunk_spec.dtype): 
        
                   chunk_value = value 
        
               else: 
        
                   chunk_value = value[out_selection] 
        
                   # handle missing singleton dimensions 
        
                   if drop_axes != (): 
        
                       item = tuple( 
        
                           None  # equivalent to np.newaxis 
        
                           if idx in drop_axes 
        
                           else slice(None) 
        
                           for idx in range(chunk_spec.ndim) 
        
                       ) 
        
                       chunk_value = chunk_value[item] 
        
               chunk_array[chunk_selection] = chunk_value 
        
               return chunk_array

There are two issues with this implementation

if existing_chunk_array is None: this code path always allocates a new empty array and then copies data from the original selection into it? Why? This seems like a great opportunity to avoid and extra memory copy. Can't we just use a view of the original data (values) when appropriate? This happens e.g. when writing to chunks which are smaller than the selection
The same code path always creates data of shape chunk_spec.shape, even for the last chunk of an array, which might be much smaller than the chunk shape.

An example which highlights both of these inefficiencies is as follows:

import numpy as np
import zarr
zarr.array(np.ones(101), chunks=(100,))

The text was updated successfully, but these errors were encountered:

rabernat · 2024-07-14T22:44:07Z

Related: this is_total_slice check will generally fail for the last chunk because the last chunk selection will not be the same size as chunk_spec.shape.

zarr-python/src/zarr/codecs/pipeline.py

Lines 361 to 362 in b8baa68

    
           None if is_total_slice(chunk_selection, chunk_spec.shape) else byte_setter, 
        
           chunk_spec.prototype,

When writing, this will trigger unnecessary trips to the store to read chunks, even though the data will be completely overwritten.

d-v-b · 2024-07-15T08:35:01Z

good catch @rabernat, I don't know this code very well but have a few high-level additions

is_total_slice has been a source of problems / confusion in the past. I suspect we need aggressive testing of this routine.
I wish we had performance tooling that could show me the memory / IO load associated with your examples. There must be something in the python space we could use for this.

rabernat · 2024-07-15T12:51:59Z

Tracemalloc should allow some tracing of allocations.

A good target would be the following: if I create a memorystore with uncompressed data, I should be able to get this data all the way to a numpy array without copying any bytes.

dcherian · 2024-07-15T13:18:18Z

Ive used the memory_profiler ipython extension to good effect. More recently scalene is getting a lot of attention

d-v-b · 2024-07-15T13:21:18Z

maybe it would make sense to create a specific hatch environment with useful benchmarking / performance monitoring tools, so we could do stuff like

hatch run perf:<invoke memory monitoring tool>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_merge_chunk_array could allocate memory more carefully #2038

_merge_chunk_array could allocate memory more carefully #2038

rabernat commented Jul 14, 2024 •

edited

Loading

rabernat commented Jul 14, 2024

d-v-b commented Jul 15, 2024

rabernat commented Jul 15, 2024

dcherian commented Jul 15, 2024

d-v-b commented Jul 15, 2024

_merge_chunk_array could allocate memory more carefully #2038

_merge_chunk_array could allocate memory more carefully #2038

Comments

rabernat commented Jul 14, 2024 • edited Loading

rabernat commented Jul 14, 2024

d-v-b commented Jul 15, 2024

rabernat commented Jul 15, 2024

dcherian commented Jul 15, 2024

d-v-b commented Jul 15, 2024

rabernat commented Jul 14, 2024 •

edited

Loading