Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sharding Support #877

Closed
jstriebel opened this issue Nov 18, 2021 · 29 comments
Closed

Add Sharding Support #877

jstriebel opened this issue Nov 18, 2021 · 29 comments

Comments

@jstriebel
Copy link
Member

jstriebel commented Nov 18, 2021

Update 2023-11-19:

Sharding is now formalized as a codec for Zarr v3 via the Zarr Enhancement Proposal (ZEP) 2.

A new efficient implementation of sharding is being discussed as part of #1569.


TLDR: We want to add sharding to zarr. There are some PRs as starting points for further discussions, as well as issue zarr-developers/zarr-specs#127 to update the spec.

Please see #877 (comment) for a comparison of the implementation approaches.


Currently, zarr maps one array chunk to one storage key, e.g. one file for the DirectoryStore. It would be great to decouple the concept of chunks (e.g. one compressible unit) and storage keys, since the storage might be optimized for larger data per entry and might have an upper limit for the number of entries, such as the file block size and maximum inode number for disk-storage. This does not necessarily fit the access patterns of the data, so chunks might need to be smaller than one storage key.

For those reasons, we would like to add sharding support to zarr. One shard corresponds to one storage key, but can contain multiple chunks:
sharding

We, this is scalable minds, will provide a PR for initial sharding support in zarr-python. This work is funded by the CZI through the EOSS program.

We see the following requirements to implement sharding support:

  1. Add shard abstraction (multiple chunks in a single file/storage object)
  2. Chunks should continue to be compressible (blosc, lz4, zstd, …)
  3. Chunks should be stored in a continuous byte stream within the shard file
  4. Shard abstraction should be transparent when reading/writing
  5. Support arbitrary array sizes (not necessarily chunk/shard aligned)
  6. Support arbitrary number of dimensions
  7. Possibly store chunks in Morton-order within a shard file

With this issue and an accompanying prototype PR we want to start a discussion about possible implementation approaches. Currently, we see four approaches, which have different pros and cons:

  1. Implement shard abstraction as a storage handler wrapper
    • Pro: Can combine logical chunks into single shard keys, which are mapped to chunks of an underlying storage handler. Translation of chunks to shard only needs to happen within the minimal API of the storage handler.
    • Pro: Partial reads and writes to the underlying store are encapsulated in a single place.
    • Con: Sharding should be configured and persisted per array rather than per store.
  2. Implement shard abstraction as a compressor wrapper (current chunks = shards which contain subchunks).
    • Pro: The current notion of chunks as a storage-key stays unchanged in the storage layer. They just correspond to what we normally call "shards" then.
    • Con: Partial reads and writes per storage-key need to be passed through all intermediate layers in the array implementation.
    • Con: Addressing a sub-chunk is not intended. This would need to change the concept that a chunk is usually read or written as a whole. This is already partially the case for blosc compression, but still asserts that compression handles data of the size of a chunk, even if not all data is needed. This approach would break this assertion and needs significant changes in the data retrieval of the array implementation.
  3. Implement shard abstraction via a translation layer on the array
    • Pro: Fits the conceptual level where sharding is configured and persisted per array.
    • Con: Adding sharding logic to the array make the array implementation even more complex. Chunk/Shard-translation and partial reads/writes would need to be added at multiple points.
  4. Implement blosc2 into Zarr (special case of 2)
    • Pro: Sharding is already implemented in the compression.
    • Con: Still, partial read and writes need to happen efficiently throughout the array implementation, see cons of approach 2.
    • Con: One core feature of zarr is the decoupling of the manipulation API, storage layer and compression. This approach tightly couples a single compression with sharding, which rather concerns storage handling. It is not possible to use sharding with other compressions then.

Based on this assessment, we currently favor a combination of approach 1 and 3. This keeps the pros of implementing sharding as a wrapper to the storage handler, which is a nice abstraction level. However, sharding should be configured and persisted per array rather than with the store config, so we propose to use a ShardingStore only internally, whereas user-facing configuration happens on the array, e.g. similar to the shape.

To make further discussions more concrete and tangible, I added an initial prototype for this implementation approach: #876. This prototype still lacks a number of features which are noted in the PR itself, but contains clear paths to tackle those.

We invite all interested parties to discuss our initial proposals and assessment, as well as the initial draft PR. Based on the discussion and design we will implement the sharding support, as well as add automated tests and documentation for the new feature.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 18, 2021

This is great! Can you give (or direct me to) a high-level description of how the scalable minds sharded format differs from the neuroglancer precomputed format? My lay understanding is that the scalable minds format uses a cartesian grid of shards each containing morton-coded chunks, whereas the neuroglancer scheme entails morton coding the chunks first, then partitioning the morton curve-traversed-chunks into shard files. Is this accurate? What are the pros / cons of these two approaches?

cc @jbms

@jbms
Copy link

jbms commented Nov 18, 2021

The neuroglancer precomputed format is rather hacky.

  • It is desirable to be able to choose the shard size to be any multiple of the base chunk size, but the neuroglancer precomputed format does not allow that due to the way shards are determined.
  • In general if your underlying storage system supports key range queries (e.g. S3, GCS, but not a regular filesystem) then it is desirable for your keys to be ordered in a way that matches access patterns. That often means morton or similar order, though it would ideally be configurable to accommodate different intended access patterns. I think in principle the actual format of the storage keys can be considered out-of-scope for the zarr spec itself, as users can define their own mapping implementation that translates the key. But in practice, as we saw with dimension_separator, it would be better for the zarr metadata itself to fully specify how the keys are formatted.

@jakirkham
Copy link
Member

cc @FrancescAlted @joshmoore

@jbms
Copy link

jbms commented Nov 18, 2021

cc @laramiel

@shoyer
Copy link
Contributor

shoyer commented Nov 19, 2021

I think sharding is could make a lot of sense, but it definitely needs to be exposed in the array API. Otherwise, one might inadvertently attempt to simultaneously write different chunks in the same shard which could lead to data corruption errors.

A related concept, which I believe Neuroglancer also supports, is using hashes to group together shards into different directories:

Together with both of these tricks I think we could achieve good scalability even to PB scale arrays.

@FrancescAlted
Copy link

We see the following requirements to implement sharding support:

  1. Add shard abstraction (multiple chunks in a single file/storage object)
  2. Chunks should continue to be compressible (blosc, lz4, zstd, …)
  3. Chunks should be stored in a continuous byte stream within the shard file
  4. Shard abstraction should be transparent when reading/writing
  5. Support arbitrary array sizes (not necessarily chunk/shard aligned)
  6. Support arbitrary number of dimensions
  7. Possibly store chunks in Morton-order within a shard file

Good to see that you are tackling this. It is also a shame that you don't find the adoption of blosc2/caterva appealing because IMO it currently fulfills most of the requirements above. But I understand that you have your own roadmap and it is fine if you prefer going with a different solution :-)

@joshmoore
Copy link
Member

Hi @FrancescAlted. We're not nearly far enough along to say that blosc2/caterva aren't appealing! 😄 (see approach #4: "Implement blosc2 into Zarr") What we realized after chatting with you was that we really didn't know enough to be able to make any of the critical decisions. The goal here is to get a prototype (or straw man if you will) in place so that as a community we can openly discuss the trade-offs, and critically find the right abstraction level. (Is it a Store? a Backend? an Array? Some new layer or mechanism that doesn't exist yet? ... Personally I agree with @jbms that we're going to want a metadata representation for cross-language handling, etc.)

Kudos to @jstriebel for kicking this off and very glad to see people jumping in both here and on #876. Please do consider this the start of a conversation though. Happy to hear more!

~Josh

@rabernat
Copy link
Contributor

It is also a shame that you don't find the adoption of blosc2/caterva appealing because IMO it currently fulfills most of the requirements above

This is absolutely not true Francesc. Speaking as a Zarr core dev, my preferred solution would be to implement sharding with Blosc2. (Not sure that Caterva is needed.) I have been consistent about this position for a while (see #713). I explicitly requested we include evaluation of Blosc2 / Caterva as part of this work.

@FrancescAlted
Copy link

This is absolutely not true Francesc. Speaking as a Zarr core dev, my preferred solution would be to implement sharding with Blosc2. (Not sure that Caterva is needed.) I have been consistent about this position for a while (see #713). I explicitly requested we include evaluation of Blosc2 / Caterva as part of this work.

Ok, happy to hear that. It is just that after having a cursory look at the proposal, it seemed to me that "we currently favor a combination of approach 1 and 3" was leaving Blosc2/Caterva a little bit out of the game. But seriously, I don't want to mess up with your decisions; feel free to take any reasonable approach you find better for your requirements.

@jbms
Copy link

jbms commented Nov 19, 2021

If blosc2/caterva were used, is there a way to make it compatible with the existing numcodecs, so that for example you could use imagecodecs_jpeg2k to encode individual chunks within the shard?

As far as using caterva as the shard format, it sounds like there would then be 3 levels of chunking: shard, chunk, block. But writing would have to happen at the granularity of shards, and reading could happen at the granularity of blocks, so it is not clear to me what purpose the intermediate "chunk" serves.

Another question regarding blosc2/caterva: what sort of I/O interface does it provide for the encoded data? In general caterva would need to be able to request particular byte ranges from the underlying store ---- if you have to supply it with the entire shard then that defeats the purpose.

@FrancescAlted
Copy link

It has been a long week and you are asking a lot of questions, but let me try to do my best in answering (at the risk of providing some nebulous thoughts).

If blosc2/caterva were used, is there a way to make it compatible with the existing numcodecs, so that for example you could use imagecodecs_jpeg2k to encode individual chunks within the shard?

Sorry, but I don't have experience with numcodecs (in fact, I was not aware that jpeg2k was among the supported codecs). I can only say that Blosc2 supports the concept of plugins for codecs and filters; you can have a better sense of how this work at my recent talk at PyData Global (recording here) and our blog post on this topic. I suppose jpeg2k could be integrated as a plugin for Blosc2, but whether or not numcodecs would be able to retrieve existing imagecodecs_jpeg2k is beyond my knowledge.

As far as using caterva as the shard format, it sounds like there would then be 3 levels of chunking: shard, chunk, block. But writing would have to happen at the granularity of shards, and reading could happen at the granularity of blocks, so it is not clear to me what purpose the intermediate "chunk" serves.

That's a valid point. Actually, Blosc2 is already using chunks as you are planning to use shards, whereas our blocks is what you are intending to use as chunks, so currently Blosc2 only has two levels of chunking. I was thinking more along the lines that Zarr could implement shards as Blosc2 frames (see my talk for a more visual description of what a frame is). Indeed you would end with 3 partition levels, but perhaps this could be interesting in some situations (e.g. for getting good slice performance in different dimensions simultaneously even inside single shards).

Another question regarding blosc2/caterva: what sort of I/O interface does it provide for the encoded data? In general caterva would need to be able to request particular byte ranges from the underlying store ---- if you have to supply it with the entire shard then that defeats the purpose.

Blosc2 support plugins for I/O too. Admittedly, this is still a bit preliminary and not as well documented as the plugins for filters and codecs but still, there is a test that may shed some light on how to use them: https://github.com/Blosc/c-blosc2/blob/main/tests/test_udio.c. You may want to experiment with this machinery, but as we did not have resources enough for finishing this properly there might be some loose ends here (patches are welcome indeed).

@jstriebel
Copy link
Member Author

Thanks everyone for lively discussion! I'd like to emphasize what Josh wrote, please do consider this the start of a conversation. We at scalable minds are interested in implementing sharding support in zarr, not in a specific strategy. The initial description simply is our own and surely incomplete assessment so far.

I'll try to answer the different threads about the general approach in this issue:


@d-v-b

Can you give (or direct me to) a high-level description of how the scalable minds sharded format differs from the neuroglancer precomputed format?

So far, we mostly use webknossos-wrap (wkw), which uses compressible chunks (called blocks in wkw). As you said, a (cartesian) cube of chunks is grouped into one file, where the chunks are concatenated in Morton-order. Each file starts with some metadata: General information such as the chunk- and shard-sizes (similar to .zarray), and a jumptable to the chunk-wise start-positions iff chunks are compressed. Our access patterns mostly are shard-wise writes and chunk-wise reads.


@shoyer Thanks for the hint about zarr-developers/zarr-specs#115, this definitely looks interesting. As I understand it this would be an orthogonal concept, right? The main benefit in sharding as proposed here is that it leverages the underlying grid (e.g. extending a dataset somewhere only touches the surface-shards). One could still design the hashing-function to effectively be cube-based sharding, but I think that this rather "hacks" the idea of hashing and might be more useful as a separate concept. Also the order of chunks within a shard/hash-location can be optimized based on the geometrical grid when using sharding, for hashing this again would be rather hacky. Does this make sense?


@FrancescAlted @rabernat blosc & caterva
As far as I understand blosc2, it does itself provide chunked access within a frame, which represents one file. Those chunks can be compressed and filtered. I see two possible concepts, how blosc2 could be integrated:

  • a As another compression, which is highly configurable. There might be optimized access for subparts of a file, similar to the current PartialReadBuffer, but this does not replace sharding itself, which is handled on another level, similar to caterva. It might also be possible to only use a blosc2-chunk instead of a frame as a zarr chunk, which seems like the correct abstraction level for a compression.
  • b As a major part of zarr, where the compression and filters of blosc2 are used instead of the current compression and filter mechanisms, and blosc2-chunks would serve as chunks in a shard. The current array representation and access must completely be changed.

To me one major appeal of zarr is the separation of storage, array access (e.g. slicing) and compression. Using blosc2 as a sharding strategy would directly couple array access and compression, since slicing must know about the details of blosc2 and use optimized read and write routines. This already is the case for blosc with the PartialReadBuffer and would need to go much further, which I think degrades readability and extensibility of the code. Also, it allows sharding only with blosc2 and compatible compressions & filters. Any custom compressions/filters would first need to be integrated with blosc2 before they could be sharded, which directly couples storage considerations (sharding) with the compression. Implementing sharding within zarr itself would allow that those parts stay decoupled as they are and seems a quite doable addition.

Based on those arguments, I personally would rather tend to integrate blosc2 as another compression (a). This wouldn't be part of this issue here, but definitely a useful addition.

Regarding caterva, I understand that it handles chunking (caterva blocks) and sharding (caterva chunks), but stores this information in a binary header, and is coupled to blosc2:
caterva   blosc

To me this seems rather like an alternative to zarr, but surely could also be used as another compression (which seemed to be the initial strategy in #713). As already mentioned by @jbms, both caterva chunks and blosc2 frames both allow chunk-wise reads in a sub-part of a file and serve a similar use-case. This seems doubled, but helps to separate the APIs, where caterva cares about data access and blosc2 about compression. This is similar to the question in a, if a zarr-chunk should correspond to a blosc2-frame or a blosc2-chunk.

As you can see, we definitely do consider using blosc2 & caterva for sharding. Atm it simply does not seem to be the most effective way to me. I hope those arguments make sense, please correct me where I'm wrong, and I'm very happy to hear more arguments about using blosc2 or caterva for sharding.


PS: Some discussion about our initial implementation draft is also going on in the PR #876. Before investing much more time into the PR, I'd like to reach consensus about the general approach here.

@FrancescAlted
Copy link

@jstriebel Yeah, as said, it is up to you to decide whether Blosc2/Caterva can bring advantages to you or not. It is just that I was thinking that this could have been a good opportunity to use Blosc2 frames as a way to interchange data among different packages with no copies. Frames are very capable beasts (63-bit capacity, metalayers, can be single-file or multi-file, can live either on-disk or in-memory), and I was kind of envisioning them as a way to easily share (multidim) data among parties.

@shoyer
Copy link
Contributor

shoyer commented Nov 24, 2021

@shoyer Thanks for the hint about zarr-developers/zarr-specs#115, this definitely looks interesting. As I understand it this would be an orthogonal concept, right?

Correct, this is an orthogonal optimization

@d70-t
Copy link
Contributor

d70-t commented Dec 15, 2021

I'm a bit late in the discussion, but hope I may still join in. Generally, I really like the idea of sharding, and I believe that any of the possible ways to implement it will help us a lot 👍

TLDR: Based on this proposal, we'd now get chunks and shards, which in total are 2 levels. What makes the number 2 special? And do we want to go from 1 to n in stead of 1 to 2?

One thing I am wondering about sharding and sub-chunks is what really defines at which level of granularity one would like to introduce a chunk or shard size. I think there's definitely one level which is the smallest unit to which one could feasibly apply compression (that would be the chunk in this approach). This should be as small as feasible in order to efficiently access slices of the data. Storing anything smaller would significantly increase storage size, so this would be the "atomic" size, and it's a limit given by the algorithms.

Above that (mostly shard, but maybe also array and group) there are optimal size ranges which are influenced by the storage and network technologies used. Those ranges may vary over time (and in particular within timescales in which a dataset may still be used, if we think e.g. about 10 years or more for archival). A dataset might for example live on S3 or the like during its more active phase, but it (or parts of it) might be migrated to tape archive for longer term storage. I'd expect that different shard sizes might be required for the transport between

  • user <-> S3 (probably 10-100 MB ?)
  • S3 <-> Tape Archive (probably 1 - 10 GB ?)

Of course there might be other storage (spinning disks / SSDs / NVMes / Optanes) and network (ethernet / infiniband etc...) technologies involved, so it's hard to find good numbers. Thus, I'm wondering if one could go more in the direction of a Cache oblivious algorithm which respects the presence of those boundaries but doesn't rely on specific sizes.


In order to support sharding at various size levels, it might be an option to change the naming scheme of the chunks of an entire array, based on Morton or Hilbert curves or a similar approach. For example by encoding the chunk index by a Morton curve (for the entire array):

chunk index chunk index binary morton binary morton path
0.0 000.000 00 00 00 0/0/0
0.1 000.001 00 00 01 0/0/1
1.0 001.000 00 00 10 0/0/2
0.2 000.010 00 01 00 0/1/0
1.2 001.010 00 01 10 0/1/2
7.2 111.010 10 11 10 2/3/2

one could rearange the chunks in a way such that spatially consistent chunks will end up within a common "folder" (as given by the morton path). Up to now, if we think of object names, this would just be a renaming of all the chunks and wouldn't change anything performance-wise. It might improve access times for folder-based filesystems a bit due to probably better locality and caching of folder-metadata (but that's likely not a whole lot).

However, if we'd now would have a possibility of dynamically selecting the unit of transfer based on the depth of the folder structure (or the length of the object name prefix), we could arrive at a sharding system which could work across various storage and transfer technologies. E.g. I could fetch the entire 0/ from tape but store 0/0/ ... 0/3/ as individual objects on S3, which then would again be groups of four chunks each of which I could fetch individual chunks using range-requests. All of this data migration would be possible while knowing relatively little about what zarr actually is.

If the underlying storage technology requires to combine multiple chunks into a single object, the shard format as suggested in the prototype PR might be just the way to go. But other options, like generic folder-packing tools (e.g. tar for a tape archive) might work as well in some cases.

How to encode morton binary into numers / letters and where to potentially place folder-like separators would of course need do be discussed or probably be a parameter.

@jstriebel
Copy link
Member Author

@d70-t Thanks for joining in. I'm not sure if I understand your motivation correctly, which seems to be about transport (e.g. cloud provider to user, archiving on tape, etc). This issue is mostly about storage, decoupling chunk-wise reads (and possibly writes) from the number of files.

If the underlying storage technology requires to combine multiple chunks into a single object, the shard format as suggested in the prototype PR might be just the way to go.

That's exactly what this issue is about. I think that aggregating data during transport (network requests etc) is a different topic (and maybe more related to fs-spec?)

@normanrz
Copy link
Member

What makes the number 2 special?

@d70-t Your proposal sounds interesting. I think the special thing about 2 is that we have one level for the unit of reading (chunk) and one as a unit of writing (shard). On-the-fly re-sharding as part of an archival process should be easily implementable with "format 2".

I wonder if the morton code you proposed would limit the sharding to cube-shaped shards (e.g. 16-16-16) instead of cuboid-shaped shards (e.g. 16-8-1). The latter could be desirable for parallelized data conversion processes (e.g. converting a large tiff stack into zarr).

@d70-t
Copy link
Contributor

d70-t commented Dec 15, 2021

@jstriebel thanks for getting back. I think my thought maybe better described as a little bit of reframing which probably results in making it more cross-topic. But in the end, that might result in a very similar implementation you've already done.

My main though was: why should we have exactly two levels of chunking? If the data ends up being stored in a larger storage hierarchy, we might need more levels. Thus I was wondering if we can rephrase the information you've currently stored in .zarray.shards such that it would become possible to have multiple levels of shards.

I've also seen that you mention morton ordering, but only for within the shards, but in principle, external morton ordering would also allow to pack shards within larger shards, resulting in just a bigger shards which is again morton-orderd. This lead me to the idea of implementing multi-level sharding through renaming the chunk-indices in stead of defining multiple .zarray.shards and use the prefix-length(s) as a definition for multiple levels of sharding.

If chunk indices would be morton-ordered, then one could write out all the data to chunk-level (if so desired) and repack those chunks according your indexed shard format, basically by truncating the object key to some prefix length. If it should become necessary to migrate data from one place to another, one might to repack (and change the prefix length) or might want to pack recursively.

So my comment maybe only touches the required .zarrray-metadata and not the packing itself, but maybe it's still worth a thought?

@jbms
Copy link

jbms commented Dec 15, 2021

It is true that there is a benefit to be gained from locality at additional levels, both by choosing the order of chunks within a shard and by choosing the shard keys. However, the added benefit over the two levels, one for reading and one for writing, as @normanz said, is likely to be marginal. Also, this sharding proposal does not preclude addressing ordering in the future. Defining an order for chunks, and a key encoding for shards, may be a bit complicated: the desired order depends on the access pattern, which may be non-trivial to specify. In some cases Morton order may be desired, but in other cases not. I think it would take some work to formulate a general, flexible way to specify chunk order/key encoding.

@d70-t
Copy link
Contributor

d70-t commented Dec 15, 2021

@normanrz if we'd use only plain morton codes, yes, we would only get cube shaped shards. I'd expect that in many cases, this is what one would actually want to have (the chunks themselves could be cuboid and I'm wondering about which cases might require different aspect ratios between shards and chunks). But if differently shaped shards would be desired, it would be possible to make the bit-order of the (non-) morton code a parameter. E.g. instead of doing

x3 x2 x1 . y3 y2 y1 -> y3 x3 y2 x2 y1 x1

for the binary index transformation, one could opt for

x3 x2 x1 . y3 y2 y1 -> y3 y2 x3 y1 x2 x1

or the like. In that case, one would get a (4 x 2) sharding if the last 3 bits of chunk index would be packed.

The difficulty I see in re-sharding the current approach is that it might be non-trivial to come up with good shard shapes during archiving and loading, because people archiving the data may think differently about the datasets than users of the data. If there would be a relatively simple way of changing the shard size, that would make the implementation of storage hierarchies a lot easier. But of course, as @jbms says, coming up with a proper bit-ordering on dataset creation might as well be more difficult than specifying a sharding as proposed.

@jakirkham
Copy link
Member

As a small note, it might be worth looking at how Parquet solved this for comparison. Certainly our use case is a bit different from there's, but there are probably still things that can be learned.

Thanks @rjzamora for pointing this out to me earlier 😄

@d70-t
Copy link
Contributor

d70-t commented Feb 4, 2022

I'm trying to wrap my head around what the pros and cons of the Store and Array based prototypes are and how they might interact with other topics (like checksums, #392 and content addressable storage transformer, zarr-developers/zarr-specs#82). What I figured out so far would be:

  • stores are meant for handling lower level things like byte-layout, networking, packing / unpacking (e.g. like zip does)
  • arrays are meant for translation between indices and storage keys
  • sharding is somewhere in between: key-translation (as in __keys_to_shard_groups__ in both prototypes) as well as grouping and packing of chunks

Within sharding, the two tasks (key-translation and packing) seem to be relatively independent and there may be reasons to swap them out independently of each other: depending on some more high-level thoughts, different groupings may be preferred, but depending on the underlying storage, different packing strategies may be preferred.

Implementing sharding in the array may have additional benefits as loads and especially stores should be scheduled such that chunks which end up in one shard would be acted on in one bulk operation, so the grouping must be known to higher level functions.
Implementing sharding in the store may have additional benefits as swapping out different byte layout mechanisms would be easier and stores would still be able to observe individual chunks. In particular if some form of checksums based on merkle trees (#392) are to be added at some point, those checksums may have to be computed on chunk level and may need to be included into whatever sharding layout will be there (see this comment for some more thoughts on that).

I'm wondering if this tension could be solved if sharding would be spread between array and store: if we keep the translation in the array and put the packing into a store (translator), we may be able to have the best of two worlds?

Let's say we have shards=(2,3), the array would translate:

chunk_key -> shard_key/part_key
0.0 -> 0.0/0.0
1.2 -> 0.0/1.2
2.1 -> 1.0/0.1
3.3 -> 1.1/1.0
...

and use the translated keys for passing chunks down to the store. An existing store would be able to operate on those keys directly. That would of course be without sharding, but may be beneficial on some filesystems (see / as separator). But it would also be possible to plug in any kind of store translator, which would just pack up everything behind the / into a shard and send this shard further down to an existing store (using only the part before the / as a key).

@jstriebel
Copy link
Member Author

@d70-t At the moment the difference between the implementations should would only have a small effect for other methods. Atm both implementations (#876 #947) implement sharding as a translation layer between the actual store and the array. One time the implementation is separated into a "Translation Store", the other time it is moved directly into the Array class. In both scenarios one can iterate over chunks or shards, depending on the actual use-case. I tried to make the distinction a bit clearer in those charts:

Current Flow without Sharding

Zarr_ Current Flow

Sharding via a Translation Store

Sharding I_ Shard Translation Store

Sharding in the Array

Sharding II_ Sharding in Array

Please note that in both variations, the Array class must allow to bundle chunk-requests (read/write) and either pass them on the the Translation Store or do some logic on the bundle itself. The translation store does not change any of the properties of the underlying store, except that "chunks" in the underlying store are actually "shards", which are translated back and forth when accessing them via the array. So any other logic, e.g. for check-sums, can either be implemented to work on fine-grained chunks (using chunks from the translation store) or the shards (using chunks from the underlying store).

Also, there's another prototype PR in the zarrita repo alimanfoo/zarrita#40, using the translation store strategy (leaving out the bundling atm). This might be helpful to compare, since I tried to keep the changes as simple as possible.

@d70-t
Copy link
Contributor

d70-t commented Feb 9, 2022

Thanks @jstriebel for these awesome drawings!

So the issue with checksums is, I believe it's not possible to do partial checks. Thus, if someone would try to do a partial_getitem(some_offset) towards a checksum-verifying store, the checksum-verifying store would have to fetch the entire item in order to do the verification, which would largely defeat the purpose of sharding. Therefore, I don't think it will be possible to implement a useful checksum checking layer between Array and Store in proposal II (except if that layer does it's own sub-chunking or assumes that partial_getitem will only be called on specific size increments, both don't seem to be good choices). Implementing checksumming in front or within Array doesn't look right.

Checksumming on proposal I may work as a layer between Array and ShardTranslationStore (as it would on the current implementation between Array and Store).

The catch here might be, that it may be useful to apply sharding to checksums as well, because at some point (many chunks) it will be better to have a tree of checksums in stead of just a list. One option would be to have a separate checksum key per shard key (which should also work with proposal I). Another option would be to be able to pack the per shard checksums within the shard.

@jstriebel
Copy link
Member Author

@d70-t Is my understanding correct to do the checksumming chunk (or shard) wise? And independently of the underlying store? That's at least my understanding of . If checksumming should happen on shard- or chunk-level could be transparently configured if there is a translation layer abstraction between the sharding, e.g.

  • Checksumming per chunk: Array --> Checksumming --> Sharding --> Store
  • Checksumming per shard: Array --> Sharding --> Checksumming --> Store

Not sure if this thought is useful. Further discussion should probably be part of zarr-developers/zarr-specs#82.

The catch here might be, that it may be useful to apply sharding to checksums as well, because at some point (many chunks) it will be better to have a tree of checksums in stead of just a list.

I guess that's important if checksumming should happen per shard. If I understand the proposal in zarr-developers/zarr-specs#82 correctly, the index into the hashed data is still stored under the original key. Using the approach Array --> Sharding --> Checksumming --> Store, those indices would be combined per shard. So under the shard-key there would be all index-jsons for the current shard, pointing to the hashed locations with the actual value.

@d70-t
Copy link
Contributor

d70-t commented Feb 10, 2022

Yes, checksums could in theory be implemented either on shard or on chunk level and independent of the store. But the choice has practical implications (independent of concrete checksumming or content addressable implementations):

If checksums are computed per shard (Array --> Sharding --> Checksumming --> Store) this defeats the purpose of sharding (or at least the possibility of less-than-shard-size reads as the checksumming layer would read the entire shard anyways).

If checksums are computed per chunk (Array --> Checksumming --> Sharding --> Store) chunk-sized reads are possible, but there must be an interface between Array and Sharding to put Checksumming into.

somewhat optional:
If Checksumming turns out to be in between Array and Sharding, the checksumming part may benefit from knowing which chunks will end up in which shards, because that information could be used to apply the same grouping to checksum lists as it is applied to the chunks. That's why there would be a benefit of having keys like shard_key/part_key in stead of chunk_key flowing across the Checksumming layer. This would also allow to create per-shard checksum lists and put them into a key like shard_key/.zchecksum or the like.


Sure, further details should go into either #392 or zarr-developers/zarr-specs#82.

@joshmoore
Copy link
Member

Although the discussion has moved for the moment to the storage transformers, there are perhaps interesting use cases in https://forum.image.sc/t/ome-zarr-chunking-questions/66794 regarding the ability to stream data data directly from sensors into a single file that might be worth considering.

@jstriebel
Copy link
Member Author

jstriebel commented Aug 22, 2022

Please note that a sharding proposal is now formalized as a Zarr Enhancement Proposal, ZEP 2. The basic sharding concept and the binary representation is the same as proposed in #876, however it is formalized as a storage transformer for Zarr v3.

Further feedback on this is very welcome! Either

I'll leave this PR open for now since it contains ideas that are not covered by ZEP 2, but please consider commenting directly on the spec or implementation if it makes sense.

@jstriebel
Copy link
Member Author

jstriebel commented Nov 19, 2023

ZEP2 proposing sharding as a codec is approved 🎉

Thank you all for your thoughtful contributions!

An efficient implementation of it is being discussed as part of #1569. Closing this issue in favor of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants