Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] benchmarks and performance tools #2034

Open
d-v-b opened this issue Jul 13, 2024 · 1 comment
Open

[v3] benchmarks and performance tools #2034

d-v-b opened this issue Jul 13, 2024 · 1 comment

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Jul 13, 2024

It would be very useful to record and publish benchmarks of how zarr-python performs in various workloads. Especially with the addition of sharding, I think people working with Zarr will benefit from some guidance for how to avoid performance problems. And without benchmarks, we can't do performance optimization of zarr-python itself.

So we should write some benchmarking code, tracking things like duration and memory usage for a few core workloads, like:

  • writing chunks to an array
  • reading chunks from an array
  • creating arrays and groups
  • deleting chunks from an existing array

As a reach goal, the benchmark code itself should be useful to people who want to check zarr-python performance on different compute / storage backends.

it looks like @JackKelly already started work in this direction at https://github.com/zarr-developers/zarr-benchmark. @JackKelly does the direction i'm proposing align with your vision for that repo?

@JackKelly
Copy link
Contributor

Sounds good!

does the direction i'm proposing align with your vision for that repo?

Absolutely!

Although I'm afraid I'm swamped with work at the moment, so I'm not sure when I'll next be able to work on that zarr-benchmark code. You're more than welcome to take that repo and do whatever you want with it!

Although, you might be better served by writing some super-simple scripts to benchmark zarr-python. I perhaps fell foul of the classic "computer science syndrome" of trying to build something general-purpose which inevitably ends up more complex than a set of special-purpose scripts 🙂

That said, as I'm sure you know, IO benchmarking is surprisingly tricky.

When reading from local IO, at a minimum you have to clear the operating system's cache (which zarr-benchmark does).

Also, if you're really trying to be as fair and reproducible as possible then you need to consider if you're going to warm things up or not (both SSDs and cloud object storage perform a little differently for "cold" reads vs "warm" reads, even if "warm" reads aren't cached locally). zarr-benchmark doesn't handle this. TBH, if we're just interested in whether one zarr-python PR is 10x faster than another, then we can ignore these details. I think these details only become important when we're trying to measure performance differences of a few percent.

These plots from the excellent "AnyBlob" paper are informative:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants