Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell probabilities' output interpretation #151

Closed
learning-MD opened this issue Aug 27, 2022 · 11 comments
Closed

Cell probabilities' output interpretation #151

learning-MD opened this issue Aug 27, 2022 · 11 comments

Comments

@learning-MD
Copy link

learning-MD commented Aug 27, 2022

Hi,

Thank you for this great tool! It's been very useful in single-nucleus RNA-seq analyses in the past for me. I'm trying it out on a hashed PBMC dataset currently (5 samples hashed together, aiming for ~5000 cells/individual sample). In general, without using Cellbender, the QC of the sample looks well and able to visualize clear distinct clusters with known PBMC markers. There's not much of an ambient plateau in the CellRanger 7.0 output:

image

With that in mind and expecting minimal background contamination, I ran CellBender with the following code:

cellbender remove-background \
--fpr 0.01 \
--expected-cells 30000 \
--epochs 150 \
--cuda \
--total-droplets-included 60000 \
--input raw_feature_bc_matrix.h5 \
--z-dim 200 \
--z-layers 1000 \
--empty-drop-training-fraction 0.3 \
--low-count-threshold 50 \
--output cellbender_output.h5

When interrogating the output, it looks like CellBender is overcalling cells:

image

image

Am I interpreting that correctly? That there truly is minimal background contamination in this hashed sample? It's a very different looking output in comparison to working with solid tissue and doing single-nucleus RNA-seq. When I perform downstream analysis of this PBMC output in Seurat, I have ~23,000 singlets for the 5 hashed samples that cluster as follows:

image
image

So, I do not think there's any QC failure on the wet lab or library prep side of things. Any suggestions for improvement would be greatly appreciated. Thanks!

Edit: If it helps, this is a superloaded experiment. With downstream QC, I removed ~9000 doublets and the total number of singlets was ~23000. The goal was to target ~4000-6000 cells per hashtagged sample.

@sjfleming
Copy link
Member

Hi @learning-MD , sorry for the late response.

Looks like a cool dataset! Very nice. You should be able to use cellbender if you want to (it just won't remove much), so we should be able to get it working.

Your interpretation is correct: cellbender did not really work on your dataset, since it thought everything was a cell.

My guess is that the problem was that cellbender had trouble finding the "ambient" empty droplet plateau. I'm not sure why the ambient plateau is so small in this dataset.

If I were you, I might try setting --low-count-threshold 5 instead of 50, just so it can see some more of those low count droplets. If that doesn't do it, maybe you could also try --empty-drop-training-fraction 0.5

Let me know if that doesn't work!

@learning-MD
Copy link
Author

@sjfleming - Stephen, thanks for the recommendations.

When I ran with --low-count-threshold 5, I got the following error which was unexpected:

cellbender:remove-background: [epoch 056]  average training loss: 413063.7640
cellbender:remove-background: Inference procedure terminated early due to a NaN value in: mu, alpha, lam

The suggested fix is to reduce the learning rate.

Rather than reducing the learning rate, I tried what you suggested with --empty-drop-training-fraction 0.5 and ended up with the following QC outputs:

image
image

It looks more consistent with what I'd expect, but not sure how to interpret that error with the first recommendation you had. Additionally, the ELBO plot looks funkier that what I've typically seen. The input was the raw .h5 file that CellRanger spit out. It does include both gene expression and multiplexing data, so not sure if that had any influence. Any insight/suggestions you have would be appreciated.

Thanks!

@sjfleming
Copy link
Member

Yikes, that first error is not something I like to see. Looks like it ran into some kind of numerical instability. Not sure why that happened...

The ELBO does look a bit strange on the training set. When you say gene expression and multiplexing data, what is the multiplexing data part? Are we talking about antibody capture or ATAC data, or what are those extra features? How many features are there in addition to gene expression?

@learning-MD
Copy link
Author

@sjfleming These were just hashtag oligos (TotalSeq C), so not CITE-seq. We did 5' scRNA-seq + HTOs to multiplex 5 samples into one lane and targeting ~4000-6000 cells per individual sample. So, total of 5 different antibody tags.

Not sure if that's helps or not.

@sjfleming
Copy link
Member

Okay I see, so the count matrix is mainly just gene expression features (plus the 5 TotalSeq_C features). That seems very reasonable. Maybe I could suggest two other things to try:

  1. It might be helpful to reduce --z-dim and --z-layers. Even though I'm sometimes tempted to go to larger latent variable sizes like you have here, that can sometimes lead to this kind of instability. It turns out that (due to the similarity of cells of the same cell type) usually you really do not need a very complex neural network to capture the essentials. (scVI uses a z-dim of 10 by default, for example)
  2. You probably don't need to do this, but you could exclude the TotalSeq_C features from the cellbender analysis. Since these features are just being used to label sample of origin, they do not contain biological information about cell type, and it's possible (though unlikely) that this could make it harder for cellbender to find a good latent representation of cell type. I think the flag --exclude-antibody-capture should allow you to try this.

@learning-MD
Copy link
Author

@sjfleming - apologies for the delay. I dropped zdim and z-layers to their defaults (100 and 500, respectively) while keeping the low count at 5 (this is where I previously had errors). Below is what I did:

cellbender remove-background \
--fpr 0.01 \
--expected-cells 30000 \
--epochs 150 \
--cuda \
--total-droplets-included 60000 \
--input raw_feature_bc_matrix.h5 \
--z-dim 100 \
--z-layers 500 \
--empty-drop-training-fraction 0.3 \
--low-count-threshold 5 \
--output cellbender_output.h5

Below is the QC of the output. I feel that this looks a lot more like what I expect the results to look like and am relatively happy with it. It looks like decreasing the z-dim and z-layers provided more stability. Curious to get your thoughts as well. Thanks!

image
image
image

@learning-MD
Copy link
Author

learning-MD commented Oct 9, 2022

Well, I'm not sure again. Using --z-dim 100, -z-layers 500, and --low-count-threshold 5, two other hashed samples got similar outputs as my original question:

image
image
image

I'm not sure what to make of this. Any other suggestions would be appreciated. Thanks.

@sjfleming
Copy link
Member

Hi @learning-MD , well it looked like the first plot you sent was pretty encouraging. I'd say that run looks just fine.

But yeah, that last plot looks like cellbender was unable to locate the empty droplets appropriately. The learning curve actually looks fine to me, I think that's okay. But it'd be nice if it got the empty droplets right.

The only thing I see in that last run that you might be able to try is to use more --total-droplets-included. Can you maybe go up to 70000 ?

I need to make a few changes to cellbender so that tweaks like this become unnecessary, but currently, I sometimes see the effect you're seeing (all droplets are being called cells) get corrected if more droplets are included in the analysis.

@learning-MD
Copy link
Author

@sjfleming Thanks! I extended the total droplets to 80000 instead:

image
image
image

Seems like maybe extending the total droplets may be the way to proceed with the hashed samples where CellBender calls everything a cell. The learning curve looks okay, but the test curve seems to dip slightly. Not sure what to make of it, but I think this is a successful run? Thanks.

@sjfleming
Copy link
Member

Okay excellent! Yes that definitely counts as a successful run. In my experience sometimes the test ELBO can meander around a little bit like that, and it's nothing to worry about.

Great!

@sjfleming
Copy link
Member

Some aspects of training have been tweaked in v0.3.0 in such a way as to make the learning curve more reliable and hopefully mitigate these kinds of issues in the future.

Closed by #238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants