Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambient background is not removed even at high FPR #80

Closed
Hrovatin opened this issue Oct 25, 2020 · 6 comments
Closed

Ambient background is not removed even at high FPR #80

Hrovatin opened this issue Oct 25, 2020 · 6 comments
Assignees
Labels
enhancement New feature or improvement
Milestone

Comments

@Hrovatin
Copy link

I tried to remove ambient expression from my pancreatic datasets with CellBender. However, even at high FPR (0.3, see image below) the insulin (Ins1, Ins2) background is not removed from non-beta cell clusters (the top left cluster is probably beta cells, other cells belong to non-ins expressing cell types). Also, for some other genes the background seems to be removed at low FPR already. The run was successful (run QC: cellbender.pdf
).
Do you have any suggestions on how to achieve the desired background removal?
Plot of selected highly ambient genes expression across UMAP at different FPR rates and no correction:
slika

@sjfleming
Copy link
Member

The plot is very interesting! Are those raw counts (colorbar), or is that scaled somehow? And from your PDF it looks like the run did go well.

In these kinds of cases, it can be difficult to say why the things we thought should be removed are not being removed. I don't know pancreatic cell biology, but I gather you're saying that you really only expect Ins1 and Ins2 from beta cells.

The reason for different rates of removal of different genes should ultimately come down to this: they are present at different rates in the ambient background (and this ambient RNA profile must be consistent with the expression in empty droplets, according to our model of the data).

I might suggest doing a diagnostic plot like this:
one dot for each gene
x-axis: inferred contribution of each gene to ambient RNA
y-axis: removal of each gene by remove-background

One way to create this plot would be to use the function load_anndata_from_input_and_output() from #57
something like this:

import matplotlib.pyplot as plt
%matplotlib inline

adata = load_anndata_from_input_and_output(input_file='raw_data.h5', output_file='remove_bkg_out.h5')

plt.semilogx(adata.var['ambient_expression'], adata.var['n_cellranger'] - adata.var['n_cellbender'], 'k.')
plt.xlabel('Presence of gene in ambient RNA\n(inferred by remove-background)')
plt.ylabel('Counts removed by remove-background')
plt.title('Removal per gene')
plt.show()

You can check out where Ins1 and Ins2 fall on this plot. If this plot shows that genes are being removed in proportion to their presence in the background, then the tool is probably doing what we'd expect.

Currently, changing FPR (as you've already done) is the only way we have of removing "more" background RNA...

What does this mean, if you really believe those counts shouldn't be there? There are a few possible explanations we have run into in the past. One possibility is that you're dealing with some kind of technical artifact that's not modeled by remove-background. One example issue we've seen is that mapping errors among some IG- genes can lead to unpredictable sorts of errors that we are not modeling (and therefore not correcting). Beyond that, I'm not sure...

@Hrovatin
Copy link
Author

Hrovatin commented Nov 4, 2020

The counts are normalised with

sc.pp.normalize_total(adata, target_sum=1e6, exclude_highly_expressed=True)
sc.pp.log1p(adata)

@Hrovatin
Copy link
Author

Hrovatin commented Nov 4, 2020

image

# Corrected data
def readCellbenderH5(filename):
    f = h5py.File(filename, 'r')
    mat=f['background_removed']
    cols=['latent_cell_probability','latent_RT_efficiency']
    rows=['ambient_expression']
    obsdict={x:mat[x] for x in cols}
    rowsdict={x:mat[x] for x in rows}
    ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:], 
                                          mat['indices'][:], 
                                          mat['indptr'][:]),
                                        shape=(mat['shape'][1],mat['shape'][0])),
                  # I had only one column of gene names
              var=pd.DataFrame(rowsdict,index=[x.decode('ascii') for x in mat['gene_names']]),
              obs=pd.DataFrame(obsdict,index=[x.decode('ascii') for x in mat['barcodes']]),
              uns={'test_elbo':list(mat['test_elbo']),'test_epoch':list(mat['test_epoch'])})
    return(ad)

adata_corrected=readCellbenderH5(folder+'cellbender/cellbender_FPR_0.2_filtered.h5')

# Data before correction - using only genes and cells present in corrected data
adata_raw=sc.read_h5ad(folder+raw_file)[adata_corrected.obs_names,adata_corrected.var_names]

# N counts per gene (sum) for raw and corrected data
n_cellranger=adata_raw.X.sum(axis=0)
n_cellbender=adata_corrected.X.sum(axis=0)


# Plot N removed counts vs ambient presence (estimated by cellbender) and colour some top ambient genes
ambient=list(adata_corrected.var['ambient_expression'])
corrected=list(np.array(n_cellranger - n_cellbender).flatten())
plt.scatter(ambient, corrected,c='k',s=1)
gene_groups=[(['Ins1','Ins2'],'r'),(['Gcg','Iapp','Ppy','Pyy'],'g'),(['mt-Co3'],'b')]
for gene_group in gene_groups:
    print('Genes:',gene_group[0],'colour:',gene_group[1])
    for idx in np.argwhere(adata_corrected.var_names.isin(gene_group[0])).flatten():
        plt.scatter([ambient[idx]], [corrected[idx]],c=gene_group[1])
plt.xlabel('Presence of gene in ambient RNA\n(inferred by remove-background)')
plt.ylabel('Counts removed by remove-background')
plt.title('Removal per gene')

@sjfleming
Copy link
Member

Yeah, okay, thanks for plotting that! It definitely looks like the two insulin genes are indeed the highest expressed in the learned ambient RNA profile. So that's in line with your expectation. And remove-background does seem to be removing many counts of those genes... pretty much in proportion to the inferred ambient expression.

My thoughts:

  1. It does appear like the tool is correctly learning the ambient profile, and also removing counts in proportion (roughly) to their presence in ambient.
  2. The only way to get more removal is to crank up FPR, although you've already tried that. But yeah, then only way you can tune that removal with this tool is by changing the FPR.

It seems like the ambient is really dominated by a handful of genes in your dataset.

@sjfleming
Copy link
Member

(I am still working on fine-tuning the way that the expected FPR is computed, and it's possible that an upcoming release will improve the FPR calculation in such a way that cranking up the FPR has more effect for a dataset like this. It might be a few months before this update gets incorporated though.)

@sjfleming sjfleming self-assigned this Nov 11, 2020
@sjfleming sjfleming added this to the v0.2.1 milestone May 3, 2021
@sjfleming sjfleming added the enhancement New feature or improvement label May 3, 2021
@sjfleming sjfleming mentioned this issue Mar 28, 2023
@sjfleming sjfleming mentioned this issue Aug 6, 2023
@sjfleming
Copy link
Member

Closed by #238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement
Projects
None yet
Development

No branches or pull requests

2 participants