Cell probabilities' output interpretation #151

learning-MD · 2022-08-27T21:53:42Z

Hi,

Thank you for this great tool! It's been very useful in single-nucleus RNA-seq analyses in the past for me. I'm trying it out on a hashed PBMC dataset currently (5 samples hashed together, aiming for ~5000 cells/individual sample). In general, without using Cellbender, the QC of the sample looks well and able to visualize clear distinct clusters with known PBMC markers. There's not much of an ambient plateau in the CellRanger 7.0 output:

With that in mind and expecting minimal background contamination, I ran CellBender with the following code:

cellbender remove-background \
--fpr 0.01 \
--expected-cells 30000 \
--epochs 150 \
--cuda \
--total-droplets-included 60000 \
--input raw_feature_bc_matrix.h5 \
--z-dim 200 \
--z-layers 1000 \
--empty-drop-training-fraction 0.3 \
--low-count-threshold 50 \
--output cellbender_output.h5

When interrogating the output, it looks like CellBender is overcalling cells:

Am I interpreting that correctly? That there truly is minimal background contamination in this hashed sample? It's a very different looking output in comparison to working with solid tissue and doing single-nucleus RNA-seq. When I perform downstream analysis of this PBMC output in Seurat, I have ~23,000 singlets for the 5 hashed samples that cluster as follows:

So, I do not think there's any QC failure on the wet lab or library prep side of things. Any suggestions for improvement would be greatly appreciated. Thanks!

Edit: If it helps, this is a superloaded experiment. With downstream QC, I removed ~9000 doublets and the total number of singlets was ~23000. The goal was to target ~4000-6000 cells per hashtagged sample.

The text was updated successfully, but these errors were encountered:

sjfleming · 2022-09-22T19:34:03Z

Hi @learning-MD , sorry for the late response.

Looks like a cool dataset! Very nice. You should be able to use cellbender if you want to (it just won't remove much), so we should be able to get it working.

Your interpretation is correct: cellbender did not really work on your dataset, since it thought everything was a cell.

My guess is that the problem was that cellbender had trouble finding the "ambient" empty droplet plateau. I'm not sure why the ambient plateau is so small in this dataset.

If I were you, I might try setting --low-count-threshold 5 instead of 50, just so it can see some more of those low count droplets. If that doesn't do it, maybe you could also try --empty-drop-training-fraction 0.5

Let me know if that doesn't work!

learning-MD · 2022-09-26T03:28:16Z

@sjfleming - Stephen, thanks for the recommendations.

When I ran with --low-count-threshold 5, I got the following error which was unexpected:

cellbender:remove-background: [epoch 056]  average training loss: 413063.7640
cellbender:remove-background: Inference procedure terminated early due to a NaN value in: mu, alpha, lam

The suggested fix is to reduce the learning rate.

Rather than reducing the learning rate, I tried what you suggested with --empty-drop-training-fraction 0.5 and ended up with the following QC outputs:

It looks more consistent with what I'd expect, but not sure how to interpret that error with the first recommendation you had. Additionally, the ELBO plot looks funkier that what I've typically seen. The input was the raw .h5 file that CellRanger spit out. It does include both gene expression and multiplexing data, so not sure if that had any influence. Any insight/suggestions you have would be appreciated.

Thanks!

sjfleming · 2022-09-27T19:39:05Z

Yikes, that first error is not something I like to see. Looks like it ran into some kind of numerical instability. Not sure why that happened...

The ELBO does look a bit strange on the training set. When you say gene expression and multiplexing data, what is the multiplexing data part? Are we talking about antibody capture or ATAC data, or what are those extra features? How many features are there in addition to gene expression?

learning-MD · 2022-09-27T20:44:13Z

@sjfleming These were just hashtag oligos (TotalSeq C), so not CITE-seq. We did 5' scRNA-seq + HTOs to multiplex 5 samples into one lane and targeting ~4000-6000 cells per individual sample. So, total of 5 different antibody tags.

Not sure if that's helps or not.

sjfleming · 2022-09-28T13:28:49Z

Okay I see, so the count matrix is mainly just gene expression features (plus the 5 TotalSeq_C features). That seems very reasonable. Maybe I could suggest two other things to try:

It might be helpful to reduce --z-dim and --z-layers. Even though I'm sometimes tempted to go to larger latent variable sizes like you have here, that can sometimes lead to this kind of instability. It turns out that (due to the similarity of cells of the same cell type) usually you really do not need a very complex neural network to capture the essentials. (scVI uses a z-dim of 10 by default, for example)
You probably don't need to do this, but you could exclude the TotalSeq_C features from the cellbender analysis. Since these features are just being used to label sample of origin, they do not contain biological information about cell type, and it's possible (though unlikely) that this could make it harder for cellbender to find a good latent representation of cell type. I think the flag --exclude-antibody-capture should allow you to try this.

learning-MD · 2022-10-08T16:47:32Z

@sjfleming - apologies for the delay. I dropped zdim and z-layers to their defaults (100 and 500, respectively) while keeping the low count at 5 (this is where I previously had errors). Below is what I did:

cellbender remove-background \
--fpr 0.01 \
--expected-cells 30000 \
--epochs 150 \
--cuda \
--total-droplets-included 60000 \
--input raw_feature_bc_matrix.h5 \
--z-dim 100 \
--z-layers 500 \
--empty-drop-training-fraction 0.3 \
--low-count-threshold 5 \
--output cellbender_output.h5

Below is the QC of the output. I feel that this looks a lot more like what I expect the results to look like and am relatively happy with it. It looks like decreasing the z-dim and z-layers provided more stability. Curious to get your thoughts as well. Thanks!

learning-MD · 2022-10-09T16:06:39Z

Well, I'm not sure again. Using --z-dim 100, -z-layers 500, and --low-count-threshold 5, two other hashed samples got similar outputs as my original question:

I'm not sure what to make of this. Any other suggestions would be appreciated. Thanks.

sjfleming · 2022-10-12T13:43:55Z

Hi @learning-MD , well it looked like the first plot you sent was pretty encouraging. I'd say that run looks just fine.

But yeah, that last plot looks like cellbender was unable to locate the empty droplets appropriately. The learning curve actually looks fine to me, I think that's okay. But it'd be nice if it got the empty droplets right.

The only thing I see in that last run that you might be able to try is to use more --total-droplets-included. Can you maybe go up to 70000 ?

I need to make a few changes to cellbender so that tweaks like this become unnecessary, but currently, I sometimes see the effect you're seeing (all droplets are being called cells) get corrected if more droplets are included in the analysis.

learning-MD · 2022-10-15T03:18:51Z

@sjfleming Thanks! I extended the total droplets to 80000 instead:

Seems like maybe extending the total droplets may be the way to proceed with the hashed samples where CellBender calls everything a cell. The learning curve looks okay, but the test curve seems to dip slightly. Not sure what to make of it, but I think this is a successful run? Thanks.

sjfleming · 2022-10-17T14:00:45Z

Okay excellent! Yes that definitely counts as a successful run. In my experience sometimes the test ELBO can meander around a little bit like that, and it's nothing to worry about.

Great!

sjfleming · 2023-08-08T19:35:39Z

Some aspects of training have been tweaked in v0.3.0 in such a way as to make the learning curve more reliable and hopefully mitigate these kinds of issues in the future.

Closed by #238

sjfleming closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cell probabilities' output interpretation #151

Cell probabilities' output interpretation #151

learning-MD commented Aug 27, 2022 •

edited

Loading

sjfleming commented Sep 22, 2022

learning-MD commented Sep 26, 2022

sjfleming commented Sep 27, 2022

learning-MD commented Sep 27, 2022

sjfleming commented Sep 28, 2022

learning-MD commented Oct 8, 2022

learning-MD commented Oct 9, 2022 •

edited

Loading

sjfleming commented Oct 12, 2022

learning-MD commented Oct 15, 2022

sjfleming commented Oct 17, 2022

sjfleming commented Aug 8, 2023

Cell probabilities' output interpretation #151

Cell probabilities' output interpretation #151

Comments

learning-MD commented Aug 27, 2022 • edited Loading

sjfleming commented Sep 22, 2022

learning-MD commented Sep 26, 2022

sjfleming commented Sep 27, 2022

learning-MD commented Sep 27, 2022

sjfleming commented Sep 28, 2022

learning-MD commented Oct 8, 2022

learning-MD commented Oct 9, 2022 • edited Loading

sjfleming commented Oct 12, 2022

learning-MD commented Oct 15, 2022

sjfleming commented Oct 17, 2022

sjfleming commented Aug 8, 2023

learning-MD commented Aug 27, 2022 •

edited

Loading

learning-MD commented Oct 9, 2022 •

edited

Loading