Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concatenate_arrays with (slightly) different array shapes #305

Open
ivirshup opened this issue Feb 23, 2023 · 2 comments · May be fixed by #374
Open

concatenate_arrays with (slightly) different array shapes #305

ivirshup opened this issue Feb 23, 2023 · 2 comments · May be fixed by #374

Comments

@ivirshup
Copy link
Contributor

I would like to be able to concatenate arrays with different shapes. They should have the size in each dimension except for the one we are concatenating along (e.g. numpy concatenation rules).

Why doesn't concatenate_arrays work in this case right now, and what would it take to work?

The current restriction of "same size" doesn't seem to be working (#304), but also the current documented restrictions don't seem accurate. For example, these arrays have the same size and chunking:

import fsspec
import numpy as np
import zarr

import kerchunk.combine
import kerchunk.zarr

fn1 = f"tmp/out1.zarr"
fn2 = f"tmp/out2.zarr"
x1 = np.arange(10)
x2 = np.arange(10, 20)
g1 = zarr.open(fn1, "w")
g1.create_dataset("x", data=x1, chunks=(3,))
g2 = zarr.open(fn2, "w")
g2.create_dataset("x", data=x2, chunks=(3,))

However concatenate_arrays produces invalid results:

ref1 = kerchunk.zarr.single_zarr(fn1, inline=0)
ref2 = kerchunk.zarr.single_zarr(fn2, inline=0)

out = kerchunk.combine.concatenate_arrays([ref1, ref2], path="x",)
mapper = fsspec.get_mapper("reference://", fo=out)
g = zarr.open(mapper)

np.all(g["x"][:], np.concatenate([x1, x2]))
# False

g["x"][:]
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  0])

This broadly makes sense to me, due to a chunk boundary not aligning with the end of the first array. Is that resolvable without ZEP 3?

And otherwise, should an additional restriction on concatenate_arrays be: (shape % chunks) == 0 for dimensions being concatenated along?

@martindurant
Copy link
Member

This broadly makes sense to me, due to a chunk boundary not aligning with the end of the first array. Is that resolvable without ZEP 3?

This is exactly right, and the reason for ZEP3 to exist.

And otherwise, should an additional restriction on concatenate_arrays be: (shape % chunks) == 0 for dimensions being concatenated along?

Yes, except the very last one, I suppose. Concatenating files that have whole chunks but a different number of chunks along sime concat dimension should be ok. I'm pretty sure we've done this, where chunksize was 24 (hours per day) but number of chunks was variable (days per month).

@ivirshup
Copy link
Contributor Author

ivirshup commented Mar 6, 2023

should an additional restriction on concatenate_arrays be: (shape % chunks) == 0 for dimensions being concatenated along?
Yes, except the very last one, I suppose.

This case will be allowed by #312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants