-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why so many "outliers" when focusing on a specific cell type? #17
Comments
Interesting edge case, for sure. We did run MC2 on more homogenous data and didn't see this effect, but we didn't go as far as feeding it just one single uniform cell type population. Thanks for bringing it to our attention. How many cells did you have in your data set? I suspect that it doesn't amount to more than 1 pile (~10K cells) so that the divide-and-conquer didn't have a chance to kick in. If I'm right you might get better results reducing the |
I actually have ~60k cells. I found that 1/4 of the outliers were due to being dissolved due to
You can see I still got 11341 outliers (3769 due to dissolved status). Thanks! |
15K UMIs per cell, wow. In the datasets we've seen this would have been an immediate suspect for being a doublet :-) The latest version in github has new quality-assurance functions that may help, specifically IN general, sometimes there are "just-plain-bad" genes which cause spurious outliers. Ideally the noisy lonely gene filter should have identified them so they would be excluded from the data set in the 1st place. Perhaps this filter failed in your case - if so, it would be interesting to understand why. The I'm still worried that somehow the algorithm does manage to work when given more heterogenous data. Is the set of excluded and forbidden genes identical in both cases? |
Thanks for the suggestion. Following your thoughts, I've done some testings on a single cell type as mentioned above. The following does NOT really make much difference in terms of reducing the number of outliers:
The only thing that makes a difference is when I down sampling my count matrix to 3k UMI count / cell using
That brings the outliers numbers to about 4%. Also I noticed that if I don't down sampling, I will have ~2% of cells being called as "-2" in the metacells results, but I don't see "-2" when I down sampling... So I suspect that the algorithm somehow doesn't do well when sequencing is deep with ~15k UMI counts / cell, especially in a homogeneous cell population. not sure. Will try out your other suggestions. Thanks! |
The Adding forbidden genes shouldn't help. Excluding genes should but that should be done very carefully, for genes that are truly "just noise" and not reflecting relevant biological behavior. If the issue is resolved by exclusion of such genes (e.g. detected by seeing they are causes for deviance in a lot of outlier cells), fine... Otherwise, I'd like to get my hands on this data set so I can figure out what's going on using more invasive measures (e.g., looking carefully at the feature genes chosen at each step of the algorithm, looking at the shape of the cell-cell Knn-graphs used, etc.) - this requires a level of debugging not easily accessible from the public APIs. Would that be possible? |
BTW - in the next release, the code will automatically adjust the target_pile_size (within a range controlled by some knobs) so it should work better. Maybe? If you want to experiment you can take the latest version from Github and try whether that makes any difference. |
We are revamping the outliers detection in genenral in version 0.9 which should also help. Note that versionn 0.9 is different from version 0.8 (see the project README). It is in final testing phases, currently you can only install it from the head version in github. |
Version 0.9 is now published - it uses a different outliers policy so it would be interesting to see if this solved the problem? |
Hi,
It seems that when I run Metacells focusing on a particular cell type (e.g. myeloid cell only, or T cell only), the program will call out many outliers with metacell ID of "-1". For example the figure I attached only contains myeloid cells, and about 25% of the cells are flagged as "outliers". However on the umap, the outliers (red dots) doesn't seem to be that much different from the "normal" cells (blue dots).
When I run Metacells on all the cell compartments altogether, I don't get much outliers. All use default parameters. So I'm wondering have you tested Metacells on specific cell compartments (i.e. with only one broad cell type) vs. on a mixture of cell types? Should I change the parameters if I want to focus on specific cell types?
Thanks!
The text was updated successfully, but these errors were encountered: