Sharing compressed data #48

adamgayoso · 2022-02-27T16:55:14Z

Hello, this is such a cool project!

I was wondering if the compressed anndata objects could be shared on the website. For example, for the full dataset, saving like write_h5ad(path, compression="gzip") reduces the file size to ~5gb from 15gb. While it takes a bit longer to save with compression, reading is still pretty fast. I also noticed an issue with adata.obs["donor"] where it's mixed string and float types, so also saving it with adata.obs["donor"] = adata.obs["donor"].astype(str) would be appreciated.

We are working on faster implementations of scvi-tools using jax. In this notebook we can process 150k cells in <5 minutes on Colab. I was hoping to create a new tutorial with your dataset to show that we can process 900k cells in < 1 hr (integration + visualization, all for free!).

The text was updated successfully, but these errors were encountered:

emdann · 2022-02-27T17:18:37Z

Hi @adamgayoso, thanks a lot for these suggestions! I will regenerate the datasets for the website and notify here as soon as they are up.
The jax speed up sounds really cool, and we'd be super happy for our data to be featured in the tutorials of course! I'd be very keen to peak at the first results, and of course I'd be very happy to help with interpretation or writing up vignettes if needed.

adamgayoso · 2022-02-27T17:20:43Z

Thank you @emdann!! It would also be beneficial if the genes used to run the scvi model were included (in adata.var) so I can better reproduce the results. Also, what batch_key was used? I saw "bbk" I think in the notebooks on this repo.

emdann · 2022-02-27T20:27:44Z

The batch key used is a concatenation of method (10X protocol, 3' or 5') and donor (see under "Add batch key " here)

adamgayoso · 2022-02-28T17:00:50Z

The new version seems to give reasonably similar results to the old version which is good. I also noticed that Scanpy umap plotting of the celltype is not working for some reason.

emdann · 2022-03-02T08:19:15Z

The .h5ad objects for download should now be updated. Could I ask you to try again to check if the plotting problem persists?

adamgayoso · 2022-03-02T20:53:57Z

Yes looks great! I still have an issue plotting

bdata.obs["celltype"] = np.array(list(bdata.obs.celltype_annotation))
sc.pl.embedding(bdata, basis="X_mde", color=['celltype'], frameon=False)

Plotting celltype_annotation alone gives an error

Could it be there are too many categories for scanpy?

emdann · 2022-03-03T09:49:13Z

Notes from troubleshooting attempts:

Part of the problem could be the NaNs (scverse/scanpy#2133): I found the maternal contaminants were not flagged correctly in this object, these are cells with adata.obs['celltype_annotation'] set to NaN. I will modify that in the file ASAP, but for now you can try filtering those out before plotting.

After filtering out nans I still get all gray, so it might indeed be a problem with scanpy trying to handle too many categories (and pandas update possibly?). Also setting groups throws a pandas error.

I usually plot annotations by lineage, using the assignment saved here. The best workaround I can suggest for now is trying something like:

import json
with open('Pan_fetal_immune/metadata/anno_groups.json', 'r') as json_file:
    anno_groups_dict = json.load(json_file)

adata.obs['annotation_plot'] = np.nan
lineage = 'B CELLS'
lineage_cells = adata.obs['celltype_annotation'].isin(anno_groups_dict[lineage])
adata.obs.loc[lineage_cells, 'annotation_plot'] = adata.obs.loc[lineage_cells, 'celltype_annotation'].copy()

sc.pl.umap(adata, color=['annotation_plot'], title=lineage)

This tells me I should probably save the annotation groups in adata.obs...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing compressed data #48

Sharing compressed data #48

adamgayoso commented Feb 27, 2022

emdann commented Feb 27, 2022

adamgayoso commented Feb 27, 2022 •

edited

Loading

emdann commented Feb 27, 2022

adamgayoso commented Feb 28, 2022

emdann commented Mar 2, 2022

adamgayoso commented Mar 2, 2022

emdann commented Mar 3, 2022

Sharing compressed data #48

Sharing compressed data #48

Comments

adamgayoso commented Feb 27, 2022

emdann commented Feb 27, 2022

adamgayoso commented Feb 27, 2022 • edited Loading

emdann commented Feb 27, 2022

adamgayoso commented Feb 28, 2022

emdann commented Mar 2, 2022

adamgayoso commented Mar 2, 2022

emdann commented Mar 3, 2022

adamgayoso commented Feb 27, 2022 •

edited

Loading