Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal I/O and on-disk storage for zarr arrays (chunking scheme + Zarr store) #54

Closed
jluethi opened this issue May 30, 2022 · 14 comments
Closed
Assignees
Labels
Backlog Backlog issues we may eventually fix, but aren't a priority

Comments

@jluethi
Copy link
Collaborator

jluethi commented May 30, 2022

Opening an issue here for a discussion topic that came up today: What is the optimal chunking we should have for our files.

Thanks to @tcompa 's work on improved chunking here, we now have the ability to rechunk zarr files and to save them with desired chunk sizes at all pyramid levels, see here: fractal-analytics-platform/fractal-client#32

What is the optimal size we should chunk files as?
There are two main concerns that may go in sync for a while, but may also become trade-off.

Concern 1)
Interactive accessibility of the data.
Specifically, how nicely can we read in just the chunk of data we are interested in and what kind of access patterns will we have (e.g. at the moment, we are loosely optimizing for displaying xy layers of the data, but not to display xz layers). Also, what is the optimal chunk size for visualization? Increasing chunk sizes for lower-res pyramid structures helped a lot in interactive performance. Would further increases still help or start to hinder performance?

Concern 2)
Number of folders and files saved to the filesystem
I'm no filesystem expert, what we've already started noticing that copying zarr files is quite slow, because it's a very nested structure. And IT at the FMI became worried about file numbers today when we presented our plans for fractal and OME-Zarr usage with regard of what it would mean about number of files needing to be saved. Their preference would be that we'd have equal or fewer files that the raw data.

Both concerns are things we should explore further to find good tradeoffs and exchange with e.g. the OME-Zarr developers about how they handle this on their side.
Also, Zarr has support for reading from zip stores directly (=> would just be 1 file on disk, right?), see her: https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives
Additionally, there are cloud storage approaches for Zarrs for AWS, Google Cloud Storage etc. that may be worth having a look at.

@jluethi
Copy link
Collaborator Author

jluethi commented May 30, 2022

See also discussion of this topic in the OME-NGFF paper:

Additionally, a key parameter in overall access times is the size of individual chunks. As chunk sizes decrease, the number of individual chunk files increases rapidly (Extended Data Fig. 2). In this benchmark, we have chosen a compromise between chunk size and number of individual files. This illustrates a primary downside of NGFF formats; as the number of files increases, the time required for copying data between locations increases. Users will need to understand and balance these trade-offs when choosing between open, bioimaging file formats.

May also want to start the longer-term discussion about object storage, as OME-NGFF is specifically mentioned as something people would choose when object-storage is required.

In situations where object storage is man- dated, as in large-scale public repositories, we encourage the use of OME-NGFF today.

@tcompa
Copy link
Collaborator

tcompa commented May 31, 2022

I processed the 23-9x8-wells dataset with the yokogawa_to_zarr, and with the current chunk-size choice.
The structure of a single well is already quite deep:

Chunks at level 0:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080), (1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280, 1280))
Chunks at level 1:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (1080, 1080, 1080, 1080, 1080, 1080), (1366, 1366, 1366, 1366, 1362))
Chunks at level 2:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (1080, 1080), (1138, 1137))
Chunks at level 3:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (720,), (758,))
Chunks at level 4:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (240,), (252,))

Concerning disk space and number of files, some approximate values are

$ du -sh /data/active/fractal/tests/Temporary_data_UZH_23_well_9x8_sites/
798G

$ find /data/active/fractal/tests/Temporary_data_UZH_23_well_9x8_sites/ -type f | wc -l
607428

(note that these numbers are lower estimates, as I had already removed a little part of that folder.. but at least they are useful as reference values)

Looking at the chunk structure (with a total number of 3*19*18*16=16416 chunks per well, which then gets multiplied by 23 wells, and also somewhat increased by pyramid levels) and at the number of files, I suspect that we should try to increase chunk size even further, making it 2560x2160.

@gusqgm
Copy link

gusqgm commented May 31, 2022

This is good insight, @tcompa !
Chunk sizes of 2560x2160 correspond to the current size of each site ( = sensor size on the microscope camera), which could be a good unit to keep.

Since currently OME-ZARR is not a file but a nested folder structure we need to find a sweet spot to compromise between structure depth and chunk size while trying to keep data visualization as smooth and responsive as possible, which makes it tricky business.

Being able to use ome-zarr as object storage could be an interesting way forward as we would not be worrying about the depth of the nested structure inside the object that much anymore, at least from the filesystem point of view. Would be great to contact some of the developers and see what is the direction there, as this may prevent us from needing to keep tweaking parameters from our side.

tcompa referenced this issue in fractal-analytics-platform/fractal-client May 31, 2022
@tcompa
Copy link
Collaborator

tcompa commented May 31, 2022

By now (fractal-analytics-platform/fractal-client@71bf905), I've just modified the hard-coded values, making them 2560x2160. Let's see how things change.

The discussion about object storage is definitely interesting, but I guess it won't lead to immediate updates, so that we should find a decent compromise for our day-to-day tests.

@jluethi
Copy link
Collaborator Author

jluethi commented May 31, 2022

That's very useful info @tcompa . Do I read the above numbers correctly that at level 4, every well consists of a single 240x252 chunk? [if so, I'm a bit confused about that number. Was this with coarsening 2 or 3? I ran coarsening 2, would expect level 4 to be 8 × 2560 / (2^5), 9 × 2160 / (2^5) => 640 x 608]

We could also eventually consider chunking in z as well, which is currently always 1.
Some pros & cons:

  • Pro: Could create larger chunks at lower-res pyramid levels
  • Pro: Could allow for different access patterns
  • Con: Less performant for our current access patterns of mostly loading slice by slice
    => Let's not change this yet, but consider that we could play with that dimension as well

Also, concerning file sizes, I ran a du -sh on the full folder for the same dataset with 5 pyramid levels and it came back as 1.2 TB. So the folders you had to remove may have had quite a bit of content still.
And seeing this dataset already consisting of over half a million individual files makes it quite clear how we could run in some limits here.

The 2560x2160 is probably a good default to keep file numbers in check, we should test the performance implications for visualization with that :) And yeah, any discussion on object storage or similar approaches is a much more long-term thought :) Good if we follow up on this, while keeping things in check as well as we can in the current setup.

But these numbers are generally very helpful. I will prepare a post on the image.sc forum, so we can discuss with the OME-Zarr devs on how they handle this for larger datasets, where they store their chunked plates etc.

@mfranzon
Copy link
Collaborator

Few comments on the huge amount of file issue. At the moment what we are doing it is something like NestedDirectoryStore, in zarr language (more detail here ), which is already a small improvement instead of using DirectoryStore (here).

I am agree that copy operations is quite slow due to the number of files and using a compression approach could be useful. For this reason I am testing a little bit the ZipStore (here) which seems to be a good solution to have all in one file but could make more complicated the writing operations.
A good approach, in my opinion, could be keep the writing process as it is, than add a new task in which we do the compression, starting from the root of the zarr ( so all the zarr file will be compressed, not only the arrays). In this way we can read from a .zarr.zip using the provided methods of zarr library.

In short:

writing --> NestedDirectoryStore
reading --> ZipStore

A big open question, using this approach, is how much costs, in terms of time, change the ome-zarr plugin in order to read into the zip file. They check for file extensions so clearly not work as it is, but what makes me a little more confident, is that the reader.py in ome-zarr library use the FSStore to open zarr files, which should be general enough also for zip file. In any case, all of this should be tested.

@tcompa
Copy link
Collaborator

tcompa commented May 31, 2022

That's very useful info @tcompa . Do I read the above numbers correctly that at level 4, every well consists of a single 240x252 chunk? [if so, I'm a bit confused about that number. Was this with coarsening 2 or 3? I ran coarsening 2, would expect level 4 to be 8 × 2560 / (2^5), 9 × 2160 / (2^5) => 640 x 608]

Coarsening is 3x, so that:

level 0: (9*2160,8*2560) = (19440, 20480)
level 4: (19440/3**4,20480/3**4) = (240,252)  # with a bit of roundoff from 252.84

@jluethi
Copy link
Collaborator Author

jluethi commented May 31, 2022

@tcompa ah, makes sense!
And that's great @mfranzon ! Are those zarr writing configurations easy to switch? We could prepare a small to medium dataset in a few of those configurations to be able to compare them, both with regard to file size, file numbers and performance in napari :)

Regarding DirectoryStore vs NestedDirectoryStore, I'm not fully sure I fully understand the distinction yet, maybe something we can have a look at together at some point.

Also, very good thought regarding the option to have "converting to zarr.zip" at the end. Chunks in zip stores aren't editable anymore according to the documentation, so we may not want to do this early during processing, but maybe once the array has been fully processed, it would make sense :)

The plugin would certainly pose a potential issue here. They just updated the reading functionality to ensure it's .zarr files only, see here: ome/napari-ome-zarr#42
They included .zarr/ though, but probably not .zarr.zip/ yet, so if we go that route, we'll have to create a workaround for it on the plugin side as well (which should be doable, but probably won't work out of the box)

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 9, 2022

There is a very interesting forum thread on this topic that @gusqgm found today, from end of last year: https://forum.image.sc/t/support-for-single-file-ome-ngff/60764/16

Sounds like we're not the only ones being concerned about this, the OME-devs are generally aware of this. But it wasn't a big priority to work on this from their side and zipped zarrs may not currently work with the ome-zarr-py reader.

There is also this very interesting idea of sharding out there, which may be a good intermediate between single zip files and folders full of single files per chunk by grouping chunks into files, while maintaining chunked access to them. See here: https://forum.image.sc/t/sharding-support-in-ome-zarr/55409/6
And here for some very visual slides on the topic by the scalableminds folk (webKnossos): https://docs.google.com/presentation/d/1sPfhYRBZGLA6RI8dAjwg8Iuaolz0Xs3O404z5l1-Rx8/edit#slide=id.ge173d3ea2b_0_88
(they have amazing performance in browsing in 3D, but also check their chunking: 32x32x32, so isometric and very small chunks!)

Something like this has been accepted into zarr v3 spec in February: zarr-developers/zarr-specs#134

I'm not really sure how much of it has been fully implemented in the python API yet, there is e.g. an issue here where a lot of discussion is happening: zarr-developers/zarr-python#877

My understanding is that Zarr v3 will still take a while until it is here. It started in 2019 and it's ongoing, so not a quick thing.

Tldr: We are clearly not the only ones who are interested in this, there is already work ongoing for it, but I don't have a clear understanding about timelines for when which parts would be ready. Also, the sharding concept seems to be what different people with this same issue are moving towards and what may be relevant for us as well, once it is available

@tcompa
Copy link
Collaborator

tcompa commented Jun 10, 2022

Thanks, very interesting to know about these ongoing discussions. Let's keep an eye on them, AKA I also clicked "Subscribe" here and there..

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 20, 2022

Had a conversation with Kevin Yamauchi today about file numbers and storage. He strongly recommended checking out on-premises object storage approaches like on-premises S3 buckets.
We should probably start this conversation with IT @gusqgm . And would be very interesting if Omero server could be a way to manage the object storage setup

@tcompa tcompa changed the title Optimal chunking: Tradeoff between interactivity & IT infrastructure? Optimal I/O and on-disk storage for zarr array (chunking scheme + Zarr store) Jun 28, 2022
@tcompa tcompa changed the title Optimal I/O and on-disk storage for zarr array (chunking scheme + Zarr store) Optimal I/O and on-disk storage for zarr arrays (chunking scheme + Zarr store) Jun 28, 2022
@jluethi jluethi self-assigned this Jul 12, 2022
@jluethi
Copy link
Collaborator Author

jluethi commented Jul 28, 2022

Very interesting discussions & inputs in this image.sc thread: https://forum.image.sc/t/storing-large-ome-zarr-files-file-numbers-sharding-best-practices/70038/6

My main takeaways for Fractal:

  1. Sharding will be the way this will eventually be solved. Things are moving, but hard to judge when sharding will be available. But on a positive note: The Python implementation (which is most relevant to us) will likely be the first implementation. That should allow us to get started with it, even if full adoption as part of OME-NGFF may still take a bit longer.

From Jonathan Striebel:

Yes, exactly. ZEP001 is the formal process of agreeing on the new spec for Zarr v3. Atm it seems like extensions such as sharding are not yet part of the initial spec (see my comment here 1), but I’ll formally propose adding sharding as a follow-up ZEP in that case soon.

I’m currently working on an implementation for sharding zarr-python, as v3 is available there as a preview.

The discussion around ZEP001 is very much active 1, so it’s definitely moving along. Personally, I’m assuming that ZEP001 gets approved after the current round of reviews, since many stakeholders have been involved in the discussion around the new spec.

  1. ZIP stores are actually still quite interesting. The main reason nobody seems to have touched them in the OME-NGFF community: They are not part of the official Zarr spec, just part of the Python implementation. As such, ZIP stores wouldn't work in other applications. But a very interesting statement from Josh Moore on this:

While zarr-python supports ZipStores, I think most other implementations don’t.

Additionally, the v2 Zarr spec is currently silent on Zips. My inclination is to work on a ZEP for them in v3, but someone with interest could look into making it happen for OME-NGFF + v2.

=> If file numbers become a large issue before sharding arrives, looking into using ZIP stores in the meantime seems like a valid idea. There may even be more support for this long term (though again, see 1, I think long term, sharding will likely be the real solution

  1. The discussion of object storage vs. filesystems isn't as one-sided as I expected. I think we should be mostly fine being on the existing file servers and consider using object storage for sharing or when they become more easily available at our institutions.

@jluethi jluethi added the Backlog Backlog issues we may eventually fix, but aren't a priority label Aug 4, 2022
@tcompa tcompa transferred this issue from fractal-analytics-platform/fractal-client Sep 7, 2022
@jluethi jluethi moved this from Ready to On Hold in Fractal Project Management Oct 12, 2022
@tcompa tcompa added the unclear Let's check whether this is still an open issue label Feb 15, 2023
@tcompa
Copy link
Collaborator

tcompa commented Feb 15, 2023

If we want to explore sharding, let's create a more specific issue.

@jluethi
Copy link
Collaborator Author

jluethi commented Feb 15, 2023

Let's continue sharding discussions here: #294

@jluethi jluethi closed this as completed Feb 15, 2023
@github-project-automation github-project-automation bot moved this from On Hold to Done in Fractal Project Management Feb 15, 2023
@tcompa tcompa removed the unclear Let's check whether this is still an open issue label Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backlog Backlog issues we may eventually fix, but aren't a priority
Projects
None yet
Development

No branches or pull requests

4 participants