Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of CellBender for scATACseq data? #167

Open
JeGrundman opened this issue Nov 16, 2022 · 13 comments
Open

Use of CellBender for scATACseq data? #167

JeGrundman opened this issue Nov 16, 2022 · 13 comments
Assignees
Labels
enhancement New feature or improvement
Milestone

Comments

@JeGrundman
Copy link

Hi,

Thank you for creating this software! It's mentioned in the manuscript for CellBender (and elsewhere on this github page) that CellBender can be used for non-scRNAseq data, though I believe only CITE-seq was mentioned as an alternative. Theoretically, I don't see why the program couldn't also be applied to scATACseq data, but I wanted to double-check here that it would be appropriate to apply to these datasets. If it can be applied without issue, do you have any recommendations for parameter changes to consider when running it specifically with scATACseq data?

Thanks for your time!

@sjfleming
Copy link
Member

Hi @JeGrundman , this is a great question, and it's something I need to do some more thinking and experimenting with myself. I personally do not have much experience with scATAC-seq data. But there are other users who have told me they've tried CellBender on ATAC data and had some mixed results. See #121 . Some people just use CellBender to clean up the scRNA-seq part of a dataset that is mixed RNA and ATAC.

Are you talking about multimodal data with both RNA and ATAC in the same cells? Or are you thinking of just scATAC-seq?

@JeGrundman
Copy link
Author

Hi! Thanks for your quick response! I was mostly interested in it for scATACseq only. I have multiome scATAC and scRNA data, but we ran them separately through CellRanger instead of using Cellranger-ARC. (We did also run CellBender on our scRNA data and it looks great.)

I tested CellBender on our scATAC data and the program runs without issue, but it's more the interpretation/ensuring that this is not an inappropriate tool for ATAC data that I'm concerned with.

@sjfleming
Copy link
Member

sjfleming commented Nov 16, 2022

It's a great question.

In principle, CellBender is a totally appropriate tool for scATAC data. The noise model should still be appropriate for ATAC data, as far as I can tell. I'm not sure if the mechanism of "ambient" / cell-free reads really holds for ATAC data in the same way... somehow it's easier for me to imagine RNA floating around in solution from damaged cells, and it's a bit harder to imagine genomic DNA floating around in a way that would incorporate into lots of droplets. But the PCR-chimera / "swapping" noise mechanism should still be relevant in scATAC data.

The only thing that makes scATAC data more challenging in principle than scRNA-seq is that scATAC data is even more sparse than scRNA-seq, which is already super sparse. There are so many features / "peaks" in scATAC data... but still, in principle, I think the auto-encoder-type structure that CellBender uses to learn a prior on biological counts should still work alright for scATAC data.

I guess one thing that you might be able to look at would be the following:

  1. Sum all the counts in empty droplets (your best guess at which droplets these are) so that you get a vector of counts per feature. Normalize the vector so it sums to 1.
  2. Sum all the counts in cell-containing droplets (your best guess at which droplets these are) so that you get another vector of counts per feature. Also normalize this so it sums to 1.
  3. Plot (1) versus (2) on linear and log-log axes, and see if it looks like the "empty droplet" profile is similar to the profile of cells. If it is, then I bet the noise mechanisms are close enough that the CellBender model should do well in principle.

@sjfleming
Copy link
Member

Actually, now that I'm looking, I see 10x has a public scATAC dataset that could serve as a great benchmark. https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_hgmm_ATACv2_nextgem_Chromium_Controller/10k_hgmm_ATACv2_nextgem_Chromium_Controller_web_summary.html

It looks like the data suffers from a type of background noise very similar to scRNA-seq data, just looking at their cross-species plot.

I will go ahead and run this and see what happens and I'll let you know! :)

@sjfleming
Copy link
Member

Indeed it looks like "ambient" noise in scATAC is pretty low:
image
(see how the empty drops only have like 20 counts)

while "swapping" noise is pretty high / on par with scRNA-seq data:
image
(see how the cross-species counts are actually pretty high, near 100, and there is a strong linear trend between total counts and cross-species counts)

@JeGrundman
Copy link
Author

Hi!

Thank you so much for your detailed response. My understanding of it is that, based on the public dataset, the majority of the noise in a scATACseq experiment that CellBender would correct for would come from "swapping" instead of ambient noise (though I do wonder if the ambient noise is more relevant in certain datasets like frozen tissue).

Regarding your suggestions for assessing similarity in cell/empty droplet profiles and on point 3, I ran this for one of our pooled samples and the result is a sort of bifurcating graph as shown here:

Screen Shot 2022-11-16 at 6 16 06 PM

I used cellranger filtered barcodes as cells and the rest of the barcodes from the raw matrix as empty, since I really don't have a better way of determining which is which at this point. The corresponding gene expression data looks far more linear. I'm a bit unsure how to interpret whether or not this means CellBender should be applied here.

@sjfleming
Copy link
Member

Okay I see. I guess this plot probably isn't definitive about whether CellBender can be applied, but it's interesting to see that there's some similarity at least between the two profiles. It looks like most features are pretty close to the y=x diagonal, but there are some features that systematically have less expression in cells than in the empties. CellBender should still be able to handle that fine.

Doing a few test runs, I'm seeing that I will need to make a couple tweaks to the codebase to make ATAC data run without error. I'll include those updates in v0.3.0

@sjfleming sjfleming self-assigned this Nov 17, 2022
@sjfleming sjfleming added the enhancement New feature or improvement label Nov 17, 2022
@sjfleming sjfleming added this to the v0.3.0 milestone Nov 17, 2022
@JeGrundman
Copy link
Author

Thank you! I really appreciate all your help and insight with this matter. By the way, I did actually manage to get CellBender run with ATAC data without error (without GPU/Cuda). Using a GPU/Cuda throws errors though, if that's what you're referring to (though I thought this was a memory issue on my end).

@sjfleming
Copy link
Member

Okay good to know CPU didn't give you an error. Yeah, the main problem is that this is never going to fit into GPU memory (any GPU... it's just too many features...), unless I incorporate some new kinds of cutoffs. I think the thing to do is to only analyze features which seem likely to be big contributors to the background noise, and assume the other features contribute a negligible amount of noise.

@sjfleming
Copy link
Member

I do think this is a safe kind of assumption... it's just a matter of correctly prioritizing features likely to contribute to background noise.

@JeGrundman
Copy link
Author

Thank you!

Adding that feature wouldn't bias the ultimate result though in terms of either the corrected counts or the cells called, right?

For my own purposes, I will probably continue tweaking some parameters/testing my data using CPU, since it runs all right. I ended up getting back around 2x as many cells from CellBender as from CellRanger, so if in principle this program is all right to run with ATAC data, the increased cell yield is something we would be highly interested in...

@JeGrundman
Copy link
Author

Hi! I just wanted to give an update in case it was of interest -- the ATAC cells outputted from CellBender ended up being of worse quality than the ones outputted only by CellRanger (comparing the cells from CellRanger to the ones uniquely called by CellBender) and did not cluster with the CellRanger-called cells. CellBender also did not appear to change the counts/remove noise in any measurable way.

The program is still great for scRNAseq and we will be using it for that data, but not for scATACseq. Thanks again for providing this software and for taking the time to talk this through with me!

@sjfleming
Copy link
Member

Well @JeGrundman thanks for letting me know that. I will put some more effort into developing and benchmarking for ATAC data in future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement
Projects
None yet
Development

No branches or pull requests

2 participants