-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
car sharding #1
Comments
Yes, car sharding should be there! I'm just wondering a bit if it should be part of the ipldstore or separate? We could use car sharding for packing small chunks into larger objects for an object store (e.g. S3) and we could also pack even more cars into a larger car to send them off to some "high(er) latency storage" like tape or filecoin etc... I'd expect this to work particularly well if the CARs contain groups which one would normally use for zarr-sharding. Depending on how the zarr keys are built (morton code?), this may or may not correspond to the carbites Treewalk method. Probably CARv2 (== CAR + index) would even make up a nice structure for zarr shards. |
Yes, I do not know if it should be part of the ipldstore or something separate. I am thinking about the use case of building a large zarr dataset on a cluster.
CARv2 does look quite nice and helpful! |
Distributed writes should definitely be possible. My guess would be that we want to have one ipldstore per worker / thread etc... As a result, every worker would only see what has been written by itself (or what would have been preloaded into the ipldstore before). I believe that this should be ok for most use cases, but I don't know for sure yet. The resulting IPLD objects would be dumped somewhere by whatever transport method suits best. An option would be to generate CARs locally and send them off. The roots (i.e. what |
Another question may be if some reorganization of the "big" CARs would be needed, in case the set of IPLD objects created by one worker doesn't match the sharding one would like to have on read. But maybe that should be regarded as an implementation detail of the system which ends up hosting the CARs / IPLD objects. |
@d70-t awesome work!!
What do you think about car sharding? That is, for a large dataset, the arrays in the dataset are split into separate car files. For large arrays, they may be split into multiple car files like carbites.
The motivation is to support upload to tools like web3.storage, estuary.tech for datasets larger than 32 GB.
The text was updated successfully, but these errors were encountered: