Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why so many "outliers" when focusing on a specific cell type? #17

Open
hurleyLi opened this issue Dec 9, 2021 · 8 comments
Open

why so many "outliers" when focusing on a specific cell type? #17

hurleyLi opened this issue Dec 9, 2021 · 8 comments

Comments

@hurleyLi
Copy link

hurleyLi commented Dec 9, 2021

Hi,
It seems that when I run Metacells focusing on a particular cell type (e.g. myeloid cell only, or T cell only), the program will call out many outliers with metacell ID of "-1". For example the figure I attached only contains myeloid cells, and about 25% of the cells are flagged as "outliers". However on the umap, the outliers (red dots) doesn't seem to be that much different from the "normal" cells (blue dots).

Screen Shot 2021-12-09 at 9 17 23 AM

When I run Metacells on all the cell compartments altogether, I don't get much outliers. All use default parameters. So I'm wondering have you tested Metacells on specific cell compartments (i.e. with only one broad cell type) vs. on a mixture of cell types? Should I change the parameters if I want to focus on specific cell types?

Thanks!

@orenbenkiki
Copy link
Collaborator

Interesting edge case, for sure. We did run MC2 on more homogenous data and didn't see this effect, but we didn't go as far as feeding it just one single uniform cell type population. Thanks for bringing it to our attention.

How many cells did you have in your data set? I suspect that it doesn't amount to more than 1 pile (~10K cells) so that the divide-and-conquer didn't have a chance to kick in. If I'm right you might get better results reducing the target_pile_size to around a half or third of the number of cells - it would be interesting to see whether that solves the issue (not that this excuses the behavior when running on a single pile).

@hurleyLi
Copy link
Author

hurleyLi commented Dec 9, 2021

I actually have ~60k cells. I found that 1/4 of the outliers were due to being dissolved due to dissolve_min_metacell_cells. I think that's because my mean UMI / cell = 15k, and by using the default target_metacell_size and dissolve_min_metacell_cells wind up having metacells with few cells and dissolved. So I tried to changed the parameter to: target_metacell_size = 500000, dissolve_min_metacell_cells = 10, but still having 17% as outliers (doesn't seem to be outliers). Here's the log:

set unnamed.var[rare_gene]: 0 true (0%) out of 27065 bools
set unnamed.obs[cells_rare_gene_module]: 66429 int32 elements with all outliers (100%)
set unnamed.obs[rare_cell]: 0 true (0%) out of 66429 bools
set unnamed.uns[pre_directs]: 11
set unnamed.uns[directs]: 10
set unnamed.var[pre_high_total_gene]: 12567 positive (46.43%) out of 27065 int32s
set unnamed.var[high_total_gene]: 13837 positive (51.13%) out of 27065 int32s
set unnamed.var[pre_high_relative_variance_gene]: 7491 positive (27.68%) out of 27065 int32s
set unnamed.var[high_relative_variance_gene]: 7449 positive (27.52%) out of 27065 int32s
set unnamed.var[forbidden_gene]: 246 true (0.9089%) out of 27065 bools
set unnamed.var[pre_feature_gene]: 1606 positive (5.934%) out of 27065 int32s
set unnamed.var[feature_gene]: 1984 positive (7.331%) out of 27065 int32s
set unnamed.var[pre_gene_deviant_votes]: 1884 positive (6.961%) out of 27065 int32s
set unnamed.var[gene_deviant_votes]: 1739 positive (6.425%) out of 27065 int32s
set unnamed.obs[pre_cell_directs]: 66429 int32s with mean 1.537
set unnamed.obs[cell_directs]: 66429 int32s with mean 1.387
set unnamed.obs[pre_pile]: 0 outliers (0%) out of 66429 int32 elements with 11 groups with mean size 6039
set unnamed.obs[pile]: 0 outliers (0%) out of 66429 int32 elements with 10 groups with mean size 6643
set unnamed.obs[pre_candidate]: 0 outliers (0%) out of 66429 int32 elements with 3650 groups with mean size 18.2
set unnamed.obs[candidate]: 1175 outliers (1.769%) out of 66429 int32 elements with 3429 groups with mean size 19.03
set unnamed.obs[pre_cell_deviant_votes]: 0 positive (0%) out of 66429 int32s
set unnamed.obs[cell_deviant_votes]: 6397 positive (9.63%) out of 66429 int32s
set unnamed.obs[pre_dissolved]: 0 true (0%) out of 66429 bools
set unnamed.obs[dissolved]: 3769 true (5.674%) out of 66429 bools
set unnamed.obs[pre_metacell]: 0 outliers (0%) out of 66429 int32 elements with 2925 groups with mean size 22.71
set unnamed.obs[metacell]: 11341 outliers (17.07%) out of 66429 int32 elements with 2547 groups with mean size 21.63
set unnamed.obs[outlier]: 11341 true (17.07%) out of 66429 bools
set metacells.var[forbidden_gene]: 246 true (0.9089%) out of 27065 bools
set metacells.var[pre_feature_gene]: 1606 positive (5.934%) out of 27065 int32s
set metacells.var[feature_gene]: 1984 positive (7.331%) out of 27065 int32s
set metacells.obs[pile]: 2547 int32s
set metacells.obs[candidate]: 2547 int32s

You can see I still got 11341 outliers (3769 due to dissolved status).
So any suggestion how I should further improve it?

Thanks!
Hurley

@orenbenkiki
Copy link
Collaborator

orenbenkiki commented Dec 9, 2021

15K UMIs per cell, wow. In the datasets we've seen this would have been an immediate suspect for being a doublet :-)

The latest version in github has new quality-assurance functions that may help, specifically compute_outliers_matches which looks for the "most similar" metacell for each outlier, and compute_deviant_fold_factors which computes which genes are the ones causing the outlier cells to be considered deviant relative to this most-similar metacell (having an expression level that is much higher than the mean of the metacell). That might help in isolating the offending genes.

IN general, sometimes there are "just-plain-bad" genes which cause spurious outliers. Ideally the noisy lonely gene filter should have identified them so they would be excluded from the data set in the 1st place. Perhaps this filter failed in your case - if so, it would be interesting to understand why. The find_noisy_lonely_genes function has some knobs that can be tweaked...

I'm still worried that somehow the algorithm does manage to work when given more heterogenous data. Is the set of excluded and forbidden genes identical in both cases?

@hurleyLi
Copy link
Author

Thanks for the suggestion. Following your thoughts, I've done some testings on a single cell type as mentioned above. The following does NOT really make much difference in terms of reducing the number of outliers:

  1. with or without find_noisy_lonely_genes
  2. specify forbidden genes based on your tutorial vs. including more forbidden genes based on the correlation of those initial forbidden genes (tried bunch of threshold for correlation, doesn't really change anything)

The only thing that makes a difference is when I down sampling my count matrix to 3k UMI count / cell using scanpy:

adata_down = sc.pp.downsample_counts(adata, counts_per_cell = 3000, copy = True)

That brings the outliers numbers to about 4%. Also I noticed that if I don't down sampling, I will have ~2% of cells being called as "-2" in the metacells results, but I don't see "-2" when I down sampling... So I suspect that the algorithm somehow doesn't do well when sequencing is deep with ~15k UMI counts / cell, especially in a homogeneous cell population. not sure.

Will try out your other suggestions.

Thanks!

@orenbenkiki
Copy link
Collaborator

The -2 indicates cells that didn't make the "clean" cut. Not having any means all the cells are "clean" (don't have a high percentage of excluded - say, mitochondrial - genes; aren't doublets; don't have too few UMIs; etc.). I guess in your case since the data is pre-groomed this makes sense.

Adding forbidden genes shouldn't help. Excluding genes should but that should be done very carefully, for genes that are truly "just noise" and not reflecting relevant biological behavior. If the issue is resolved by exclusion of such genes (e.g. detected by seeing they are causes for deviance in a lot of outlier cells), fine...

Otherwise, I'd like to get my hands on this data set so I can figure out what's going on using more invasive measures (e.g., looking carefully at the feature genes chosen at each step of the algorithm, looking at the shape of the cell-cell Knn-graphs used, etc.) - this requires a level of debugging not easily accessible from the public APIs. Would that be possible?

@orenbenkiki
Copy link
Collaborator

BTW - in the next release, the code will automatically adjust the target_pile_size (within a range controlled by some knobs) so it should work better. Maybe? If you want to experiment you can take the latest version from Github and try whether that makes any difference.

@orenbenkiki
Copy link
Collaborator

We are revamping the outliers detection in genenral in version 0.9 which should also help. Note that versionn 0.9 is different from version 0.8 (see the project README). It is in final testing phases, currently you can only install it from the head version in github.

@orenbenkiki
Copy link
Collaborator

Version 0.9 is now published - it uses a different outliers policy so it would be interesting to see if this solved the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants