-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sharding v1.0 Spec & Storage Transformers to Zarr v3.0 #134
Add Sharding v1.0 Spec & Storage Transformers to Zarr v3.0 #134
Conversation
Co-authored-by: Norman Rzepka <[email protected]>
Co-authored-by: Norman Rzepka <[email protected]>
One thing that I think would really help address this question is if we consider what other storage transformers might be useful, and how they would stack before and after sharding. Two ideas I've heard mentioned are:
It seems to me that 1 (content-addressable storage) could be stacked after sharding, though since it would deduplicate entire shards it would not be particularly useful (and to actually verify the checksum of partial reads would require a more complicated tree-of-checksums approach). It is less clear how it would work to stack it before sharding; probably the sharding tranformer would need to be modified to just pass through the hash keys (so the shards would just store the keys of each chunk, not the chunk itself), but then it would fail to accomplish the original objective of sharding, which is to reduce the number of keys in the underlying store. It seems that (2) could be stacked before or after sharding, though if stacked after sharding it would need a more elaborate tree-of-checksums approach in order to allow verifying the checksums of partial reads. If stacked before sharding, it could also be implemented as a codec. The other point, not directly relevant to the spec, is that it would be useful to expose in a convenient way on the array class itself something that indicates the optimal chunk shape for writing (i.e. the shape of each shard); previously users would just use the chunk shape for this purpose, but with sharding a separate attribute is needed. |
Agreed! That's very important to assess if storage transformers are a helpful concept here. We might not need to "fix" compatibility of sharding with possible future storage transformers yet, but it's very useful to explore. My understanding for the CAS proposal is that the hash keys are stored at the original data location, and that the value entries are stored under a different prefix, e.g. Besides the two possible future proposals you mentioned, I found two storage transformers already in the current v2 zarr-python implementation, which are implemented as stores wrapping another store (basically the same thing as a storage transformer, without the formalizations):
Also, compatibility layers might be implemented at the storage transformer level, to be able to transform metadata directly. What I quite like about the storage transformer stack concept already, it makes it easy to wrap your head around and communicate the exact behavior (e.g. what does X before Y or Y before X in the stack), since it brings a clear interface and abstraction. However, I think it's also perfectly fine to "just" introduce sharding as a core mechanism, without the additional transformer concept. |
Hi @jstriebel, great to see this and very happy to review. One small thought, in approaching this it might be useful to distinguish between storage transformers, which do actually make some change to how the data is organised within the base store, and other types of storage layer such as caching, which are not changing how data are organised but offer some other advantage such as performance. The obvious reason for this distinction is that if a true storage transformer has been used, then a zarr implementation that's trying to read the data needs to know about it, and so it needs to be included in the zarr metadata so the implementation can discover it. That is equally true regardless of which zarr implementation is being used, and what type of base storage is being used. Other storage layers such as caches are configured at runtime depending on the application and/or the zarr implementation and/or the base store type, and so should not be included in zarr metadata. |
I wonder about the efficiency of the indexed shard format. Depending on shard/chunk size, reading the index could potentially be moderately expensive. In the worst case (without caching), there would be twice the number of IO operations required to read a chunk. The alternative would be some form of higher level index, e.g., at the level of an entire array. For moderate numbers of chunks (e.g., <1e7) this might be desirable. Do we have a sense of what shard/chunk sizes would be desired in practice? That might help inform these design decisions. |
The desired shard size probably depends on a lot of factors, such as relative cost/overhead per key/value pair on the underlying store, the throughput of the underlying store, and the desired write pattern. But I think for GCS or similar cloud storage I would probably target 1GB shards normally, perhaps containing several thousand chunks. Possibly for a very large array and a fast network could go larger, ~10GB, but then the shard index starts to be rather large. A single shard index for the entire array means it is challenging to write the array in parallel without additional coordination, which I think is a major drawback. Additionally, it seems like in the uncached case it is actually worse than a separate index per shard -- you would still be reading the entire shard index. However it is true that if you first read a complete index of all chunks and cache it, then it would be more efficient to be able to read that index with a single read rather than one read per shard. |
This sounds about right to me.
I agree, I was imagining this as more similar to consolidated metadata -- something that is written after the fact. In any case, I guess this is something that could be layered on later, if desired. |
Interesting connection here in what happens in kerchunk? You can already create a virtual dataset over many zarr datasets, where each of those may live in different buckets or even different backing stores. This would correspond to the idea earlier in this thread of moving sharding into the storage layer, but also comes with the benefit of consolidating the metadata. I'm not saying it's a replacement, but perhaps a complementary way to achieving something similar, one that already works in V2. |
@alimanfoo Good point, absolutely agree! I think runtime-configured storage-transformers could mostly be appended or prepended to the stack. Do you already have further feedback regarding the specs? I'm on vacation for a week, but happy to apply any feedback afterwards.
@martindurant Agreed! I think there's value in having this in the zarr standard, but interesting to see how this would be done in kerchunk. This would require configuring each shard in kerchunk, right? |
Each chunk of each shard is referenced by the collected reference set. So kerchunk scans the inputs just once and collates this information. "configuring" sounds more heavy, I'm not sure what you mean - it's pretty simple. |
@alimanfoo It would be great to get your review on this proposal, could you maybe give an estimate when you have time for it? |
@jstriebel - thanks a lot for all of your hard work and patience. 🙏 This PR has revealed that our governance is still lacking in terms of ability to make major decisions in a timely way. I recognize that this must be frustrating for a contributor such as yourself who has put a lot of work into Zarr. The situation reflects the fact that we are still in transition away from a BDFL governance model towards a more community-based one. I am very optimistic that the ZEP process (currently being refined in zarr-developers/governance#16) will give us a good framework to decide on this PR. I hope we will conclude the ZEP discussion and accept it by the end of next week. After that, we can reframe this PR as a ZEP, with your top-level comment basically serving as the text of the ZEP. If my suggestion in zarr-developers/governance#16 (review) is accepted, approval would require
I will personally commit to doing whatever I can over the next weeks to ensure that this process evolves in a timely way. Because this work touches the spec so deeply, you have inadvertently become a 🐹 (guinea pig, not hamster!) in our nascent spec governance process. I know that was probably not what you signed up for. But I am very optimistic that this effort will bootstrap a much more efficient and sustainable framework for evolving the Zarr spec. So thank you again! |
A brief update from the rousing community call yesterday evening (my time): it seems that the tendency is towards simplifying the V3 spec and getting it out and then layering this proposal on top in the least breaking way possible. |
As discussed during recent community meetings and steering council, merged this proposal into the dev branch as a common basis for discussions. The final list of features to be included in v3.0 is to be decided. |
This PR adds a specification for the sharding protocol, as discussed in issue #127 and explored in Python implementations previously1. From those discussions the presented sharding design emerged as a reasonable candidate we hope to include with zarr v3. We, this is scalable minds, will also provide a PR based on this spec for sharding support in zarr-python v3 after those changes are incorporated into the zarr-spec. This work is funded by the CZI through the EOSS program.
This PR adds the sharding & storage transformer concepts to the zarr v3 specs.
The following parts are copied from the spec in this PR:
Sharding
Motivation
Storage transformer
Sequence diagram with storage transformer
Please click the arrows on the right to see the diagramThe spec contains many more details, please see the diff of this PR for a full overview.
To see the rendered spec in an auto-reloading server, I ran
We'd be very happy to get feedback on this PR, high-level as well as concrete suggestions and questions.
Two high-level questions that came up directly during the design:
Storage Transformers
here? This concept seems to be a great point to plug in other future extensions, and quite the right abstraction for sharding. An alternative would be to incorporate sharding as one specific mechanism without adding the whole transformer concept. Implementation wise this would behave just like one specific transformer, but the logic could also be implemented closer to the Array or Store instead.2storage_transformer
key for each transformer would rather be calledextension
and the extension would need to be listed in the array metadataextensions
key I guess. This would be helpful if optional storage transformers might be added. One such example might be a caching storage transformer (not actually transforming things, just caching).@alimanfoo In the community call two weeks ago there was the suggestion that you might be a great candidate for the first main review, since there's quite some overlap with issue #82. What do you think?
Thanks everyone for the helpful discussions so far! We hope to get the design finalized via this spec and look forward to getting this into zarr production 🎉
cc @joshmoore @d-v-b @jbms @jakirkham @MSanKeys963 @rabernat @shoyer @d70-t @chris-allen @mkitti @FrancescAlted @martindurant @DennisHeimbigner @constantinpape
Footnotes
Previous explorations regarding Sharding with the Python implementations:
https://github.com/zarr-developers/zarr-python/issues/877: original issue for the Python implementation, containing many discussions around the sharding protocol design
https://github.com/zarr-developers/zarr-python/issues/876: initial prototype PR
https://github.com/zarr-developers/zarr-python/issues/947: another prototype where the translation logic happens purely in the Array class
https://github.com/alimanfoo/zarrita/pull/40: PR in the clean room implementation zarrita in the spirit of the Prototype I, but simplified as much as possible ↩
Sidenote: I think the storage transformer serves a similar purpose on the store side as filters do on the encoding side. They provide a clear interface to add store-agnostic functionality, similar as filters do that for encodings. See also the collapsed
Sequence diagram with storage transformer
above. ↩