Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to check whether cellbender might have overcorrected the count matrix? #173

Closed
Jiayi-Zheng opened this issue Dec 15, 2022 · 2 comments
Closed
Assignees
Labels
user question User question about a specific dataset

Comments

@Jiayi-Zheng
Copy link

Jiayi-Zheng commented Dec 15, 2022

Hello, there is this sample that I re-did analysis after learning about cellbender. I was checking for some gene family expressions (I'm working on developmental bio so it's mainly growth factors family) on other scRNA-seq datasets when I thought difference might come from cellbender processing and came to check.
This is the result (I'm using Seurat to do heatmaps, it's a human fetal sample) when I tried to grep all FOX family genes from dataset, the graph above is unprocessed, the graph below is after cellbender processing, I did not do other correction to count matrix aside from general normalization and cell cycle regression by seurat tutorial.
image
image

NOTCH family (above pre-process, below processed):
image
image
NKX family (NKX2-1 is a key marker and something we expected for expression):
image
image

The report from cellbender looked pretty normal. The training progress had a downward curve but went back up smoothly at the later stage.
GW15_CellBender.pdf

My question arises from that I was doing analysis on some human stem cell culture (cellbender processed), and was comparing the sample expression profile to some human fetal dataset I analyzed previously (without cellbender), and saw this pattern (as above). Initially I thought it might be problem with in vivo and in vitro growth environment, but then realized I've actually got some analyzed dataset for comparison and realized the issue).

Wonder if other people have seen similar things?

The command line I used for cellbender is the one as example, the cell estimation number is modified according to estimation from cellranger report. The empty droplet range was also set by the curve and instruction accordingly.

Going back to my topic of discussion. One way of checking for ambient RNA is, of course, with the knowledge of certain RNAs should or should not be expressed in certain cells. However, when dealing with cells that we are unsure about its content, is there some way to check for the possibility of overcorrection?

Thank you so much!

@sjfleming
Copy link
Member

Hi @Jiayi-Zheng , this is a great question.

Two things:

  1. We are in the process of building a few more guarantees into version 0.3.0 (not released yet), which will explicitly guarantee that each gene has no more counts removed than what the noise model can really account for. In version 0.2.0, the solution is certainly guided in that direction as much as we can, but there is no explicit mathematical guarantee, which there will be in version 0.3.0
  2. That being said, version 0.2.0 or 0.2.1 should still be able to do a good job here. Validation is the hard part! But I might have a few suggestions...

It looks like the FOX family genes are a good example. Seurat's heatmap plot looks like it shows a lot more variability in the raw plot than in the cellbender output plot. But I have a few questions:

  1. Have you plotted the z-scored gene expression there? Is that why the scale goes from -2 to +2 ? It might be more useful for the purposes of a cellbender comparison to plot the raw counts themselves. Because you're interested in transcription factors, and transcription factors are typically super lowly expressed, we can get a better picture of what's going on if we plot raw counts (integers).
  2. Once you make a plot of the integer counts before and after, you can get an idea of how many counts are being removed per cell. You can also try to compute the average counts per cell (maybe broken down by cell type) before and after cellbender. Then it would be useful to also compute the average counts in empty droplets (making your best approximation). If the difference in raw versus cellbender is roughly similar to the average counts in empty droplets, then you can have a lot more confidence that the cellbender result is justified.

I suspect that a lot of the reason that the plots for cellbender outputs for the NOTCH family and the NKX family look strange is that the counts are so low to begin with in the raw data, that you're in the regime of 0, 1, 2, or 3 counts per droplet. That's what would account for the discrete nature of the colors in the plot for the raw data.

Removal of background noise from entries in the count matrix that start out as 0, 1, 2, or 3 in the raw data is a very hard problem... if you do find that cellbender seems to be overcorrecting on these low-count genes you're interested in, then there is one more possible alternative you could try: you could exclude low-count genes (below some kind of cutoff you impose) from the cellbender analysis, and only use the cellbender output data for genes that are above your cutoff. There is nothing wrong with this in principle, in my mind. It is easier for cellbender to denoise highly-expressed genes. And highly expressed genes also happen to contribute much more to background noise. Lowly expressed genes contribute very little to background noise, and cellbender should not change their counts drastically. Again, versions before 0.3.0 lack an explicit mathematical guarantee of this (though it should still do the best it can)... but we are trying to make it a real guarantee in v0.3.0.

@sjfleming
Copy link
Member

Overcorrection is something we thought very hard about in v0.3.0, and so there more guarantees that we are not overcorrecting (at least, not more than the specified false positive rate --fpr).

Closed by #238

@sjfleming sjfleming self-assigned this Aug 8, 2023
@sjfleming sjfleming added the user question User question about a specific dataset label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user question User question about a specific dataset
Projects
None yet
Development

No branches or pull requests

2 participants