Running Kana on large datasets (millions of cells) #84

slowkow · 2022-01-12T16:19:52Z

slowkow
Jan 12, 2022

Some datasets are too big for a machine with insufficient RAM.

Even if a user's machine has more than enough RAM, my current understanding is that the Chrome web browser limits users to 4GB per tab. (More details here).

So, if Kana only accepts files on the user's local disk, then the user cannot run Kana on their files with millions of cells.

I'm interested to learn more about how to work around such limits and discuss two ideas below.

Cloudflare

Robert Aboukhalil has a great blog post that helped me to understand how Cloudflare might be useful for bioinformatics web apps.

Cloudflare can fetch a large file will millions of cells from a cloud provider like S3. Next, Cloudflare can run the WASM Kana code remotely and store the results remotely. Finally, the Kana user interface can fetch the subset of results that the user wants to see right now.

This means the user will never need to download the full dataset or the full results. Instead, the user will download a few megabytes of data on-the-fly to get a few million UMAP coordinates along with a few genes' expression values. So, it is certainly feasible to run Kana on millions of cells — but the analysis would be run remotely.

However, the current .kana file format for analysis results does not support random access, so it does not support fetching subsets of results.

In contrast, the zarr file format does support random access and is easily extensible for any data. It is already in use by some groups who analyze single-cell data.

Desktop app

A desktop app built on top of the underlying C++ code will not have an artificial 4GB limit on memory.

I do not know whether or not a desktop app built with Node or Deno or Tauri are affected by the 4GB per-tab limit that exists inside Google Chrome.

One obvious benefit of a desktop app is that the user keeps their data on their machine without uploading it to Cloudflare.

LTLA · 2022-01-12T16:45:32Z

LTLA
Jan 12, 2022
Maintainer

Even if a user's machine has more than enough RAM, my current understanding is that the Chrome web browser limits users to 4GB per tab. (More details here).

Not just that. WebAssembly uses 32-bit pointers, which imposes a 4 GB limit on the structures that can be created in the C++ code. This restriction should be lifted whenever the wasm64 proposal is taken up by Wasm runtimes, but until then, that's what we've got.

Cloudflare can fetch a large file will millions of cells from a cloud provider like S3. Next, Cloudflare can run the WASM Kana code remotely and store the results remotely. Finally, the Kana user interface can fetch the subset of results that the user wants to see right now.

This has been considered, albeit not with Cloudflare. The Workers have runtime limits that will probably be exceeded if your dataset is truly large. They also have a 1 MB upload limit for the total size of the app, and our Wasm file alone is already getting close to that, nevermind all the other JS bits and pieces around it. If we were to do this, we would likely need a full fledged EC2 instance.

This means the user will never need to download the full dataset or the full results. Instead, the user will download a few megabytes of data on-the-fly to get a few million UMAP coordinates along with a few genes' expression values. So, it is certainly feasible to run Kana on millions of cells — but the analysis would be run remotely.

We thought about this but have yet to decide on an approach. The problem is whether one is really running an analysis with kana, given that the data and compute are hosted remotely (with the associated issues of data privacy and backend deployment that kana was designed to avoid). If we were to implement this, it would probably be a different application that takes a subset of kana's features and provides some kind of backend specification for state and feature queries.

A desktop app built on top of the underlying C++ code will not have an artificial 4GB limit on memory.

Provided it's not using Wasm, that is possible. For example, scran.chan uses the same C++ code to provide the same analysis functions as kana but in an R context. One could imagine writing the same bindings in your language of choice, e.g., Python, Julia, Golang...

Of course, whether I want to tangle with libraries like Qt to create the UI is another matter altogether.

2 replies

slowkow Jan 12, 2022
Author

Thanks for the reply, Aaron! I didn't know about some of the limits (wasm32 and Cloudflare 1MB).

scran.chan is cool! Thanks for sharing the link. I'm amazed at the number of algorithms you've translated and packaged into C++.

I agree with you that hosting data and compute remotely means that we're talking about completely different application — probably outside the scope of Kana. I guess platforms for running Jupyter notebooks remotely (e.g. Google CoLab, RStudio Server) are already filling this niche.

On that note, I would love to see folks from 10x Genomics integrate Kana into the web_summary.html output from Cell Ranger. That seems like a perfect fit.

jkanche Jan 28, 2022
Maintainer

a workaround for the 1MB limit of cloudflare workers is to split up the wasm binary into multiple parts - one for t-SNE, UMAP, marker detection and the rest of the C++ code.

we can also lazily stream wasm and instantiate it so it won't count towards the 1MB limit.

even though this would bring down the size, given the cloudflare limits it may not still be feasible for compute heavy applications

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Kana on large datasets (millions of cells) #84

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Running Kana on large datasets (millions of cells) #84

slowkow Jan 12, 2022

Cloudflare

Desktop app

Replies: 1 comment · 2 replies

LTLA Jan 12, 2022 Maintainer

slowkow Jan 12, 2022 Author

jkanche Jan 28, 2022 Maintainer

slowkow
Jan 12, 2022

Replies: 1 comment 2 replies

LTLA
Jan 12, 2022
Maintainer

slowkow Jan 12, 2022
Author

jkanche Jan 28, 2022
Maintainer