-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimal I/O and on-disk storage for zarr arrays (chunking scheme + Zarr store) #54
Comments
See also discussion of this topic in the OME-NGFF paper:
May also want to start the longer-term discussion about object storage, as OME-NGFF is specifically mentioned as something people would choose when object-storage is required.
|
I processed the 23-9x8-wells dataset with the
Concerning disk space and number of files, some approximate values are
(note that these numbers are lower estimates, as I had already removed a little part of that folder.. but at least they are useful as reference values) Looking at the chunk structure (with a total number of |
This is good insight, @tcompa ! Since currently OME-ZARR is not a file but a nested folder structure we need to find a sweet spot to compromise between structure depth and chunk size while trying to keep data visualization as smooth and responsive as possible, which makes it tricky business. Being able to use ome-zarr as object storage could be an interesting way forward as we would not be worrying about the depth of the nested structure inside the object that much anymore, at least from the filesystem point of view. Would be great to contact some of the developers and see what is the direction there, as this may prevent us from needing to keep tweaking parameters from our side. |
By now (fractal-analytics-platform/fractal-client@71bf905), I've just modified the hard-coded values, making them 2560x2160. Let's see how things change. The discussion about object storage is definitely interesting, but I guess it won't lead to immediate updates, so that we should find a decent compromise for our day-to-day tests. |
That's very useful info @tcompa . Do I read the above numbers correctly that at level 4, every well consists of a single 240x252 chunk? [if so, I'm a bit confused about that number. Was this with coarsening 2 or 3? I ran coarsening 2, would expect level 4 to be 8 × 2560 / (2^5), 9 × 2160 / (2^5) => 640 x 608] We could also eventually consider chunking in z as well, which is currently always 1.
Also, concerning file sizes, I ran a The 2560x2160 is probably a good default to keep file numbers in check, we should test the performance implications for visualization with that :) And yeah, any discussion on object storage or similar approaches is a much more long-term thought :) Good if we follow up on this, while keeping things in check as well as we can in the current setup. But these numbers are generally very helpful. I will prepare a post on the image.sc forum, so we can discuss with the OME-Zarr devs on how they handle this for larger datasets, where they store their chunked plates etc. |
Few comments on the huge amount of file issue. At the moment what we are doing it is something like I am agree that copy operations is quite slow due to the number of files and using a compression approach could be useful. For this reason I am testing a little bit the In short: writing --> NestedDirectoryStore A big open question, using this approach, is how much costs, in terms of time, change the ome-zarr plugin in order to read into the zip file. They check for file extensions so clearly not work as it is, but what makes me a little more confident, is that the |
Coarsening is 3x, so that:
|
@tcompa ah, makes sense! Regarding Also, very good thought regarding the option to have "converting to zarr.zip" at the end. Chunks in zip stores aren't editable anymore according to the documentation, so we may not want to do this early during processing, but maybe once the array has been fully processed, it would make sense :) The plugin would certainly pose a potential issue here. They just updated the reading functionality to ensure it's .zarr files only, see here: ome/napari-ome-zarr#42 |
There is a very interesting forum thread on this topic that @gusqgm found today, from end of last year: https://forum.image.sc/t/support-for-single-file-ome-ngff/60764/16 Sounds like we're not the only ones being concerned about this, the OME-devs are generally aware of this. But it wasn't a big priority to work on this from their side and zipped zarrs may not currently work with the ome-zarr-py reader. There is also this very interesting idea of Something like this has been accepted into zarr v3 spec in February: zarr-developers/zarr-specs#134 I'm not really sure how much of it has been fully implemented in the python API yet, there is e.g. an issue here where a lot of discussion is happening: zarr-developers/zarr-python#877 My understanding is that Zarr v3 will still take a while until it is here. It started in 2019 and it's ongoing, so not a quick thing. Tldr: We are clearly not the only ones who are interested in this, there is already work ongoing for it, but I don't have a clear understanding about timelines for when which parts would be ready. Also, the |
Thanks, very interesting to know about these ongoing discussions. Let's keep an eye on them, AKA I also clicked "Subscribe" here and there.. |
Had a conversation with Kevin Yamauchi today about file numbers and storage. He strongly recommended checking out on-premises object storage approaches like on-premises S3 buckets. |
Very interesting discussions & inputs in this image.sc thread: https://forum.image.sc/t/storing-large-ome-zarr-files-file-numbers-sharding-best-practices/70038/6 My main takeaways for Fractal:
From Jonathan Striebel:
=> If file numbers become a large issue before sharding arrives, looking into using ZIP stores in the meantime seems like a valid idea. There may even be more support for this long term (though again, see 1, I think long term, sharding will likely be the real solution
|
If we want to explore sharding, let's create a more specific issue. |
Let's continue sharding discussions here: #294 |
Opening an issue here for a discussion topic that came up today: What is the optimal chunking we should have for our files.
Thanks to @tcompa 's work on improved chunking here, we now have the ability to rechunk zarr files and to save them with desired chunk sizes at all pyramid levels, see here: fractal-analytics-platform/fractal-client#32
What is the optimal size we should chunk files as?
There are two main concerns that may go in sync for a while, but may also become trade-off.
Concern 1)
Interactive accessibility of the data.
Specifically, how nicely can we read in just the chunk of data we are interested in and what kind of access patterns will we have (e.g. at the moment, we are loosely optimizing for displaying xy layers of the data, but not to display xz layers). Also, what is the optimal chunk size for visualization? Increasing chunk sizes for lower-res pyramid structures helped a lot in interactive performance. Would further increases still help or start to hinder performance?
Concern 2)
Number of folders and files saved to the filesystem
I'm no filesystem expert, what we've already started noticing that copying zarr files is quite slow, because it's a very nested structure. And IT at the FMI became worried about file numbers today when we presented our plans for fractal and OME-Zarr usage with regard of what it would mean about number of files needing to be saved. Their preference would be that we'd have equal or fewer files that the raw data.
Both concerns are things we should explore further to find good tradeoffs and exchange with e.g. the OME-Zarr developers about how they handle this on their side.
Also, Zarr has support for reading from zip stores directly (=> would just be 1 file on disk, right?), see her: https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives
Additionally, there are cloud storage approaches for Zarrs for AWS, Google Cloud Storage etc. that may be worth having a look at.
The text was updated successfully, but these errors were encountered: