Ambient background is not removed even at high FPR #80

Hrovatin · 2020-10-25T08:43:02Z

I tried to remove ambient expression from my pancreatic datasets with CellBender. However, even at high FPR (0.3, see image below) the insulin (Ins1, Ins2) background is not removed from non-beta cell clusters (the top left cluster is probably beta cells, other cells belong to non-ins expressing cell types). Also, for some other genes the background seems to be removed at low FPR already. The run was successful (run QC: cellbender.pdf
).
Do you have any suggestions on how to achieve the desired background removal?
Plot of selected highly ambient genes expression across UMAP at different FPR rates and no correction:

sjfleming · 2020-10-27T23:08:50Z

The plot is very interesting! Are those raw counts (colorbar), or is that scaled somehow? And from your PDF it looks like the run did go well.

In these kinds of cases, it can be difficult to say why the things we thought should be removed are not being removed. I don't know pancreatic cell biology, but I gather you're saying that you really only expect Ins1 and Ins2 from beta cells.

The reason for different rates of removal of different genes should ultimately come down to this: they are present at different rates in the ambient background (and this ambient RNA profile must be consistent with the expression in empty droplets, according to our model of the data).

I might suggest doing a diagnostic plot like this:
one dot for each gene
x-axis: inferred contribution of each gene to ambient RNA
y-axis: removal of each gene by remove-background

One way to create this plot would be to use the function load_anndata_from_input_and_output() from #57
something like this:

import matplotlib.pyplot as plt
%matplotlib inline

adata = load_anndata_from_input_and_output(input_file='raw_data.h5', output_file='remove_bkg_out.h5')

plt.semilogx(adata.var['ambient_expression'], adata.var['n_cellranger'] - adata.var['n_cellbender'], 'k.')
plt.xlabel('Presence of gene in ambient RNA\n(inferred by remove-background)')
plt.ylabel('Counts removed by remove-background')
plt.title('Removal per gene')
plt.show()

You can check out where Ins1 and Ins2 fall on this plot. If this plot shows that genes are being removed in proportion to their presence in the background, then the tool is probably doing what we'd expect.

Currently, changing FPR (as you've already done) is the only way we have of removing "more" background RNA...

What does this mean, if you really believe those counts shouldn't be there? There are a few possible explanations we have run into in the past. One possibility is that you're dealing with some kind of technical artifact that's not modeled by remove-background. One example issue we've seen is that mapping errors among some IG- genes can lead to unpredictable sorts of errors that we are not modeling (and therefore not correcting). Beyond that, I'm not sure...

Hrovatin · 2020-11-04T12:31:55Z

The counts are normalised with

sc.pp.normalize_total(adata, target_sum=1e6, exclude_highly_expressed=True)
sc.pp.log1p(adata)

Hrovatin · 2020-11-04T16:04:51Z

# Corrected data
def readCellbenderH5(filename):
    f = h5py.File(filename, 'r')
    mat=f['background_removed']
    cols=['latent_cell_probability','latent_RT_efficiency']
    rows=['ambient_expression']
    obsdict={x:mat[x] for x in cols}
    rowsdict={x:mat[x] for x in rows}
    ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:], 
                                          mat['indices'][:], 
                                          mat['indptr'][:]),
                                        shape=(mat['shape'][1],mat['shape'][0])),
                  # I had only one column of gene names
              var=pd.DataFrame(rowsdict,index=[x.decode('ascii') for x in mat['gene_names']]),
              obs=pd.DataFrame(obsdict,index=[x.decode('ascii') for x in mat['barcodes']]),
              uns={'test_elbo':list(mat['test_elbo']),'test_epoch':list(mat['test_epoch'])})
    return(ad)

adata_corrected=readCellbenderH5(folder+'cellbender/cellbender_FPR_0.2_filtered.h5')

# Data before correction - using only genes and cells present in corrected data
adata_raw=sc.read_h5ad(folder+raw_file)[adata_corrected.obs_names,adata_corrected.var_names]

# N counts per gene (sum) for raw and corrected data
n_cellranger=adata_raw.X.sum(axis=0)
n_cellbender=adata_corrected.X.sum(axis=0)


# Plot N removed counts vs ambient presence (estimated by cellbender) and colour some top ambient genes
ambient=list(adata_corrected.var['ambient_expression'])
corrected=list(np.array(n_cellranger - n_cellbender).flatten())
plt.scatter(ambient, corrected,c='k',s=1)
gene_groups=[(['Ins1','Ins2'],'r'),(['Gcg','Iapp','Ppy','Pyy'],'g'),(['mt-Co3'],'b')]
for gene_group in gene_groups:
    print('Genes:',gene_group[0],'colour:',gene_group[1])
    for idx in np.argwhere(adata_corrected.var_names.isin(gene_group[0])).flatten():
        plt.scatter([ambient[idx]], [corrected[idx]],c=gene_group[1])
plt.xlabel('Presence of gene in ambient RNA\n(inferred by remove-background)')
plt.ylabel('Counts removed by remove-background')
plt.title('Removal per gene')

sjfleming · 2020-11-05T16:00:00Z

Yeah, okay, thanks for plotting that! It definitely looks like the two insulin genes are indeed the highest expressed in the learned ambient RNA profile. So that's in line with your expectation. And remove-background does seem to be removing many counts of those genes... pretty much in proportion to the inferred ambient expression.

My thoughts:

It does appear like the tool is correctly learning the ambient profile, and also removing counts in proportion (roughly) to their presence in ambient.
The only way to get more removal is to crank up FPR, although you've already tried that. But yeah, then only way you can tune that removal with this tool is by changing the FPR.

It seems like the ambient is really dominated by a handful of genes in your dataset.

sjfleming · 2020-11-05T16:02:06Z

(I am still working on fine-tuning the way that the expected FPR is computed, and it's possible that an upcoming release will improve the FPR calculation in such a way that cranking up the FPR has more effect for a dataset like this. It might be a few months before this update gets incorporated though.)

sjfleming · 2023-08-08T18:59:52Z

Closed by #238

sjfleming self-assigned this Nov 11, 2020

sjfleming added this to the v0.2.1 milestone May 3, 2021

sjfleming added the enhancement New feature or improvement label May 3, 2021

sjfleming mentioned this issue Mar 28, 2023

v0.3.0 #189

Closed

sjfleming mentioned this issue Aug 6, 2023

v0.3.0 #238

Merged

sjfleming closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambient background is not removed even at high FPR #80

Ambient background is not removed even at high FPR #80

Hrovatin commented Oct 25, 2020

sjfleming commented Oct 27, 2020

Hrovatin commented Nov 4, 2020

Hrovatin commented Nov 4, 2020

sjfleming commented Nov 5, 2020

sjfleming commented Nov 5, 2020

sjfleming commented Aug 8, 2023

Ambient background is not removed even at high FPR #80

Ambient background is not removed even at high FPR #80

Comments

Hrovatin commented Oct 25, 2020

sjfleming commented Oct 27, 2020

Hrovatin commented Nov 4, 2020

Hrovatin commented Nov 4, 2020

sjfleming commented Nov 5, 2020

sjfleming commented Nov 5, 2020

sjfleming commented Aug 8, 2023