why so many "outliers" when focusing on a specific cell type? #17

hurleyLi · 2021-12-09T14:25:18Z

Hi,
It seems that when I run Metacells focusing on a particular cell type (e.g. myeloid cell only, or T cell only), the program will call out many outliers with metacell ID of "-1". For example the figure I attached only contains myeloid cells, and about 25% of the cells are flagged as "outliers". However on the umap, the outliers (red dots) doesn't seem to be that much different from the "normal" cells (blue dots).

When I run Metacells on all the cell compartments altogether, I don't get much outliers. All use default parameters. So I'm wondering have you tested Metacells on specific cell compartments (i.e. with only one broad cell type) vs. on a mixture of cell types? Should I change the parameters if I want to focus on specific cell types?

Thanks!

orenbenkiki · 2021-12-09T14:42:02Z

Interesting edge case, for sure. We did run MC2 on more homogenous data and didn't see this effect, but we didn't go as far as feeding it just one single uniform cell type population. Thanks for bringing it to our attention.

How many cells did you have in your data set? I suspect that it doesn't amount to more than 1 pile (~10K cells) so that the divide-and-conquer didn't have a chance to kick in. If I'm right you might get better results reducing the target_pile_size to around a half or third of the number of cells - it would be interesting to see whether that solves the issue (not that this excuses the behavior when running on a single pile).

hurleyLi · 2021-12-09T20:25:24Z

I actually have ~60k cells. I found that 1/4 of the outliers were due to being dissolved due to dissolve_min_metacell_cells. I think that's because my mean UMI / cell = 15k, and by using the default target_metacell_size and dissolve_min_metacell_cells wind up having metacells with few cells and dissolved. So I tried to changed the parameter to: target_metacell_size = 500000, dissolve_min_metacell_cells = 10, but still having 17% as outliers (doesn't seem to be outliers). Here's the log:

set unnamed.var[rare_gene]: 0 true (0%) out of 27065 bools
set unnamed.obs[cells_rare_gene_module]: 66429 int32 elements with all outliers (100%)
set unnamed.obs[rare_cell]: 0 true (0%) out of 66429 bools
set unnamed.uns[pre_directs]: 11
set unnamed.uns[directs]: 10
set unnamed.var[pre_high_total_gene]: 12567 positive (46.43%) out of 27065 int32s
set unnamed.var[high_total_gene]: 13837 positive (51.13%) out of 27065 int32s
set unnamed.var[pre_high_relative_variance_gene]: 7491 positive (27.68%) out of 27065 int32s
set unnamed.var[high_relative_variance_gene]: 7449 positive (27.52%) out of 27065 int32s
set unnamed.var[forbidden_gene]: 246 true (0.9089%) out of 27065 bools
set unnamed.var[pre_feature_gene]: 1606 positive (5.934%) out of 27065 int32s
set unnamed.var[feature_gene]: 1984 positive (7.331%) out of 27065 int32s
set unnamed.var[pre_gene_deviant_votes]: 1884 positive (6.961%) out of 27065 int32s
set unnamed.var[gene_deviant_votes]: 1739 positive (6.425%) out of 27065 int32s
set unnamed.obs[pre_cell_directs]: 66429 int32s with mean 1.537
set unnamed.obs[cell_directs]: 66429 int32s with mean 1.387
set unnamed.obs[pre_pile]: 0 outliers (0%) out of 66429 int32 elements with 11 groups with mean size 6039
set unnamed.obs[pile]: 0 outliers (0%) out of 66429 int32 elements with 10 groups with mean size 6643
set unnamed.obs[pre_candidate]: 0 outliers (0%) out of 66429 int32 elements with 3650 groups with mean size 18.2
set unnamed.obs[candidate]: 1175 outliers (1.769%) out of 66429 int32 elements with 3429 groups with mean size 19.03
set unnamed.obs[pre_cell_deviant_votes]: 0 positive (0%) out of 66429 int32s
set unnamed.obs[cell_deviant_votes]: 6397 positive (9.63%) out of 66429 int32s
set unnamed.obs[pre_dissolved]: 0 true (0%) out of 66429 bools
set unnamed.obs[dissolved]: 3769 true (5.674%) out of 66429 bools
set unnamed.obs[pre_metacell]: 0 outliers (0%) out of 66429 int32 elements with 2925 groups with mean size 22.71
set unnamed.obs[metacell]: 11341 outliers (17.07%) out of 66429 int32 elements with 2547 groups with mean size 21.63
set unnamed.obs[outlier]: 11341 true (17.07%) out of 66429 bools
set metacells.var[forbidden_gene]: 246 true (0.9089%) out of 27065 bools
set metacells.var[pre_feature_gene]: 1606 positive (5.934%) out of 27065 int32s
set metacells.var[feature_gene]: 1984 positive (7.331%) out of 27065 int32s
set metacells.obs[pile]: 2547 int32s
set metacells.obs[candidate]: 2547 int32s

You can see I still got 11341 outliers (3769 due to dissolved status).
So any suggestion how I should further improve it?

Thanks!
Hurley

orenbenkiki · 2021-12-09T21:36:15Z

15K UMIs per cell, wow. In the datasets we've seen this would have been an immediate suspect for being a doublet :-)

The latest version in github has new quality-assurance functions that may help, specifically compute_outliers_matches which looks for the "most similar" metacell for each outlier, and compute_deviant_fold_factors which computes which genes are the ones causing the outlier cells to be considered deviant relative to this most-similar metacell (having an expression level that is much higher than the mean of the metacell). That might help in isolating the offending genes.

IN general, sometimes there are "just-plain-bad" genes which cause spurious outliers. Ideally the noisy lonely gene filter should have identified them so they would be excluded from the data set in the 1st place. Perhaps this filter failed in your case - if so, it would be interesting to understand why. The find_noisy_lonely_genes function has some knobs that can be tweaked...

I'm still worried that somehow the algorithm does manage to work when given more heterogenous data. Is the set of excluded and forbidden genes identical in both cases?

hurleyLi · 2021-12-10T01:59:05Z

Thanks for the suggestion. Following your thoughts, I've done some testings on a single cell type as mentioned above. The following does NOT really make much difference in terms of reducing the number of outliers:

with or without find_noisy_lonely_genes
specify forbidden genes based on your tutorial vs. including more forbidden genes based on the correlation of those initial forbidden genes (tried bunch of threshold for correlation, doesn't really change anything)

The only thing that makes a difference is when I down sampling my count matrix to 3k UMI count / cell using scanpy:

adata_down = sc.pp.downsample_counts(adata, counts_per_cell = 3000, copy = True)

That brings the outliers numbers to about 4%. Also I noticed that if I don't down sampling, I will have ~2% of cells being called as "-2" in the metacells results, but I don't see "-2" when I down sampling... So I suspect that the algorithm somehow doesn't do well when sequencing is deep with ~15k UMI counts / cell, especially in a homogeneous cell population. not sure.

Will try out your other suggestions.

Thanks!

orenbenkiki · 2021-12-10T05:39:29Z

The -2 indicates cells that didn't make the "clean" cut. Not having any means all the cells are "clean" (don't have a high percentage of excluded - say, mitochondrial - genes; aren't doublets; don't have too few UMIs; etc.). I guess in your case since the data is pre-groomed this makes sense.

Adding forbidden genes shouldn't help. Excluding genes should but that should be done very carefully, for genes that are truly "just noise" and not reflecting relevant biological behavior. If the issue is resolved by exclusion of such genes (e.g. detected by seeing they are causes for deviance in a lot of outlier cells), fine...

Otherwise, I'd like to get my hands on this data set so I can figure out what's going on using more invasive measures (e.g., looking carefully at the feature genes chosen at each step of the algorithm, looking at the shape of the cell-cell Knn-graphs used, etc.) - this requires a level of debugging not easily accessible from the public APIs. Would that be possible?

orenbenkiki · 2022-03-02T10:59:49Z

BTW - in the next release, the code will automatically adjust the target_pile_size (within a range controlled by some knobs) so it should work better. Maybe? If you want to experiment you can take the latest version from Github and try whether that makes any difference.

orenbenkiki · 2023-04-03T12:56:19Z

We are revamping the outliers detection in genenral in version 0.9 which should also help. Note that versionn 0.9 is different from version 0.8 (see the project README). It is in final testing phases, currently you can only install it from the head version in github.

orenbenkiki · 2023-07-27T09:34:04Z

Version 0.9 is now published - it uses a different outliers policy so it would be interesting to see if this solved the problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why so many "outliers" when focusing on a specific cell type? #17

why so many "outliers" when focusing on a specific cell type? #17

hurleyLi commented Dec 9, 2021

orenbenkiki commented Dec 9, 2021

hurleyLi commented Dec 9, 2021

orenbenkiki commented Dec 9, 2021 •

edited

Loading

hurleyLi commented Dec 10, 2021

orenbenkiki commented Dec 10, 2021

orenbenkiki commented Mar 2, 2022

orenbenkiki commented Apr 3, 2023

orenbenkiki commented Jul 27, 2023

why so many "outliers" when focusing on a specific cell type? #17

why so many "outliers" when focusing on a specific cell type? #17

Comments

hurleyLi commented Dec 9, 2021

orenbenkiki commented Dec 9, 2021

hurleyLi commented Dec 9, 2021

orenbenkiki commented Dec 9, 2021 • edited Loading

hurleyLi commented Dec 10, 2021

orenbenkiki commented Dec 10, 2021

orenbenkiki commented Mar 2, 2022

orenbenkiki commented Apr 3, 2023

orenbenkiki commented Jul 27, 2023

orenbenkiki commented Dec 9, 2021 •

edited

Loading