Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing compressed data #48

Open
adamgayoso opened this issue Feb 27, 2022 · 7 comments
Open

Sharing compressed data #48

adamgayoso opened this issue Feb 27, 2022 · 7 comments

Comments

@adamgayoso
Copy link

Hello, this is such a cool project!

I was wondering if the compressed anndata objects could be shared on the website. For example, for the full dataset, saving like write_h5ad(path, compression="gzip") reduces the file size to ~5gb from 15gb. While it takes a bit longer to save with compression, reading is still pretty fast. I also noticed an issue with adata.obs["donor"] where it's mixed string and float types, so also saving it with adata.obs["donor"] = adata.obs["donor"].astype(str) would be appreciated.

We are working on faster implementations of scvi-tools using jax. In this notebook we can process 150k cells in <5 minutes on Colab. I was hoping to create a new tutorial with your dataset to show that we can process 900k cells in < 1 hr (integration + visualization, all for free!).

@emdann
Copy link
Member

emdann commented Feb 27, 2022

Hi @adamgayoso, thanks a lot for these suggestions! I will regenerate the datasets for the website and notify here as soon as they are up.
The jax speed up sounds really cool, and we'd be super happy for our data to be featured in the tutorials of course! I'd be very keen to peak at the first results, and of course I'd be very happy to help with interpretation or writing up vignettes if needed.

@adamgayoso
Copy link
Author

adamgayoso commented Feb 27, 2022

Thank you @emdann!! It would also be beneficial if the genes used to run the scvi model were included (in adata.var) so I can better reproduce the results. Also, what batch_key was used? I saw "bbk" I think in the notebooks on this repo.

@emdann
Copy link
Member

emdann commented Feb 27, 2022

The batch key used is a concatenation of method (10X protocol, 3' or 5') and donor (see under "Add batch key " here)

@adamgayoso
Copy link
Author

The new version seems to give reasonably similar results to the old version which is good. I also noticed that Scanpy umap plotting of the celltype is not working for some reason.

@emdann
Copy link
Member

emdann commented Mar 2, 2022

The .h5ad objects for download should now be updated. Could I ask you to try again to check if the plotting problem persists?

@adamgayoso
Copy link
Author

Yes looks great! I still have an issue plotting

bdata.obs["celltype"] = np.array(list(bdata.obs.celltype_annotation))
sc.pl.embedding(bdata, basis="X_mde", color=['celltype'], frameon=False)

Plotting celltype_annotation alone gives an error

Could it be there are too many categories for scanpy?

image

@emdann
Copy link
Member

emdann commented Mar 3, 2022

Notes from troubleshooting attempts:

Part of the problem could be the NaNs (scverse/scanpy#2133): I found the maternal contaminants were not flagged correctly in this object, these are cells with adata.obs['celltype_annotation'] set to NaN. I will modify that in the file ASAP, but for now you can try filtering those out before plotting.

After filtering out nans I still get all gray, so it might indeed be a problem with scanpy trying to handle too many categories (and pandas update possibly?). Also setting groups throws a pandas error.

I usually plot annotations by lineage, using the assignment saved here. The best workaround I can suggest for now is trying something like:

import json
with open('Pan_fetal_immune/metadata/anno_groups.json', 'r') as json_file:
    anno_groups_dict = json.load(json_file)

adata.obs['annotation_plot'] = np.nan
lineage = 'B CELLS'
lineage_cells = adata.obs['celltype_annotation'].isin(anno_groups_dict[lineage])
adata.obs.loc[lineage_cells, 'annotation_plot'] = adata.obs.loc[lineage_cells, 'celltype_annotation'].copy()

sc.pl.umap(adata, color=['annotation_plot'], title=lineage)

Screenshot 2022-03-03 at 09 43 01

This tells me I should probably save the annotation groups in adata.obs...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants