[Enhancement] Performance Tweaks #125

jhnnsrs · 2022-07-19T07:48:43Z

While exploring zarr.js performance on big datasets, I realized that there are some limitations when loading and decoding lots of chunks, as for know everything happens in the main thread ( #33 was mentioning this issue before). I played a bit around and found that using "treads.js" and their pool implementation speed up the loading dramatically, as decoding can happen in a webworker (threadjs also provides the same abstraction for running on node), however this required a rewrite of getBasicSelection in ZarrArray.

I understood from #33 that this might be mission creep, but maybe it would be an option to give the ability to offload the decoding to the store (by maybe extending DecodingStore), then the store implementation could handle the retrieval and decoding in a worker pool? (There are a few gotchas of what you are able to pass to and from a webworker but one way is to only send the Codec Config and the ChunkKey to the worker).

The text was updated successfully, but these errors were encountered:

manzt · 2022-07-19T15:20:43Z

Thanks for exploring this and offering some ideas.

I am generally hesitant to add any sort of Worker-specific features to zarr. I've used threads.js in a couple of projects and tried to package workers previously (https://github.com/geotiffjs/geotiff.js/, https://github.com/hms-dbmi/viv, https://github.com/gosling-lang/gosling.js) and I just don't think the ecosystem is mature enough for packaging worker code consistently; it has always resulted in serious headaches.

The primary issue in my experience is that packaging for npm requires additional bundling of the library that add tons of maintenance complexity and requires making trade-offs for platforms. For example, using ES6 module imports in a Worker is only supported in Chrome, so any worker entrypoint must bundle it's dependencies completely or use importScripts.

With that said, one thing I have been curious about (but have yet to explore) is providing a codec implementation in an application using zarr which is worker-based. Something like,

import { addCodec } from 'zarr';

class BloscWorkerPool { /* ... */ }

addCodec('blosc', BloscWorkerPool); // override builtin module

I'd be curious to see how you modified the code in getBasicSelection.

jhnnsrs · 2022-07-20T09:18:49Z

Thanks for the quick answer! Completely understood. Bundling is a nightmare.

I like the codec approach but i am wondering (not very well versed in Javascript concurrency) what the tradeoff would to have fetch run in main and then pass the buffer to the codec pool (as transferable object) vs have both run in the worker. I guess that fetch does some magic under the hood that it would not make a difference?

Happy to share my implementation, it is however a react hook at this point, and needs some refactoring to be condensed. :D

If possible I would love to discuss about Viv as well if you have the time? (Trying to implement it as a viewer for a data management and workflow platform). Maybe we could schedule a little online discussion?

manzt · 2022-07-20T21:14:10Z

I like the codec approach but i am wondering (not very well versed in Javascript concurrency) what the tradeoff would to have fetch run in main and then pass the buffer to the codec pool (as transferable object) vs have both run in the worker. I guess that fetch does some magic under the hood that it would not make a difference?

Oh wow, this question actually made me realize you could implement a Store that performs fetching and decoding entirely in a worker. Basically, the custom store could act as a transformation layer on top of the original store, which intercepts .zarray metadata and rewrites the compression to null. Then when chunks are fetched, the store can decode the chunks on the fly, and pass them to zarr.ZarrArray uncompressed. Since the .zarray metadata is overwritten, zarr.ZarrArray won't try to decompress the chunks. E.g.,

import { openArray, HTTPStore } from 'zarr';

let url = "https://example.com/data.zarr";

// fetch the metadata 
let meta = await fetch(url + '/.zarray').then(res => res.json());
console.log(meta.compressor); 
// {
//     "blocksize": 0,
//     "clevel": 5,
//     "cname": "lz4",
//     "id": "blosc",
//     "shuffle": 1
// }

// fetching and decoding happen on main thread withing `ZarrArray`
let store = new HTTPStore(url);
let arr = await openArray(store);
console.log(arr.compressor); // Blosc,

// fetching and decoding happen in a Worker (inside the `getItem` method of the store)
let store = new HTTPWorkerStore(url);
let arr = await openArray(store);
console.log(arr.compressor); // null, store modifies the `.zarray` and decodes chunks itself

If possible I would love to discuss about Viv as well if you have the time? (Trying to implement it as a viewer for a data management and workflow platform). Maybe we could schedule a little online discussion?

That would be great! Send me an email and we can find a time to chat ([email protected]).

jhnnsrs · 2022-07-22T11:24:04Z

Oh wow, this question actually made me realize you could implement a Store that performs fetching and decoding entirely in a worker. Basically, the custom store could act as a transformation layer on top of the original store, which intercepts .zarray metadata and rewrites the compression to null. Then when chunks are fetched, the store can decode the chunks on the fly, and pass them to zarr.ZarrArray uncompressed. Since the .zarray metadata is overwritten, zarr.ZarrArray won't try to decompress the chunks. E.g.,

Love that! To maybe reuse some logic it would be nice to have some ComposedStore that would allow to insert part of that "mittleware" logic, maybe following a pattern like "Apollo Link". Something that could help also with some commonly compute intensive tasks like "rescaling".e.g

let store = composeStore(
     decodeLink,
     rescaleLink,
     fetchTerminatingLink
)

toloudis · 2022-10-14T21:04:24Z

Just found this thread and the code I wrote might be of interest as showing some way to deal with worker loading.
We are loading multichannel data (granted there is some prior knowledge of the sizes I am dealing with) and I fetch each channel on a separate worker:
workers launched here:
https://github.com/allen-cell-animated/volume-viewer/blob/f4a1fefffbbb2cb292c024bc660ae69a2f3696b4/src/VolumeLoader.ts#L269
and the worker implementation is here:
https://github.com/allen-cell-animated/volume-viewer/blob/f4a1fefffbbb2cb292c024bc660ae69a2f3696b4/src/workers/FetchZarrWorker.ts#L1

ctwhome mentioned this issue Jan 17, 2024

Fetch and decompress data in its own worker for performance (example) NLeSC/zarrviz#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Performance Tweaks #125

[Enhancement] Performance Tweaks #125

jhnnsrs commented Jul 19, 2022

manzt commented Jul 19, 2022 •

edited

Loading

jhnnsrs commented Jul 20, 2022

manzt commented Jul 20, 2022 •

edited

Loading

jhnnsrs commented Jul 22, 2022 •

edited

Loading

toloudis commented Oct 14, 2022

[Enhancement] Performance Tweaks #125

[Enhancement] Performance Tweaks #125

Comments

jhnnsrs commented Jul 19, 2022

manzt commented Jul 19, 2022 • edited Loading

jhnnsrs commented Jul 20, 2022

manzt commented Jul 20, 2022 • edited Loading

jhnnsrs commented Jul 22, 2022 • edited Loading

toloudis commented Oct 14, 2022

manzt commented Jul 19, 2022 •

edited

Loading

manzt commented Jul 20, 2022 •

edited

Loading

jhnnsrs commented Jul 22, 2022 •

edited

Loading