Why are the results of AUCell not centered around 0.5 (for a random "regulon")? #440
Replies: 1 comment
-
Enrichment values in AUCell are calculated based on the relative ranking of genes within each individual cell, meaning that the AUC scores are interpreted within the context of each cell’s own gene ranking. In other words, AUCell calculates how enriched or active a particular regulon is relative to other genes within the same cell. This means that the AUC score is cell-specific and reflects the gene activity within the cell as compared to other cells in the dataset. If you were expecting AUC values to be centered around 0.5 (as you might see in traditional ROC curves for binary classification), it’s important to understand that in AUCell’s case, AUC scores reflect the ranking of genes in a particular regulon. A higher AUC indicates that the regulon’s genes are more enriched and ranked higher relative to other genes within that specific cell. On the other hand, an AUC closer to 0 suggests the regulon’s genes are ranked lower relative to the rest of the genes in that cell. Therefore, the AUC score is not globally centered around 0.5, as might be the case in a typical classification task, but rather reflects how each cell ranks the genes within a regulon relative to the rest of the genes in the same cell. This allows AUCell to capture cell-specific gene activity patterns rather than assuming a uniform or global scale across all cells. |
Beta Was this translation helpful? Give feedback.
-
Hi there,
I'm trying to better understand how to interpret the output of AUCell. In particular, if we were to calculate the ranking of genes in a regulon, and plot on the x axis the ranking, and the y axis the cumulative number of genes in the regulon at or above the corresponding ranking, we would have a typical AUC/ROC curve. A "regulon" consisting of randomly selected genes would have an AUC of 0.5, the maximally enriched regulon having and AUC of 1, and the maximally suppressed having one bounded below by 0. This is equivalent (ignoring the case of ties) to the common language effect size between the rankings of the genes in the regulon and the genes not in the regulon, on a cell-by-cell basis.
I suspect that this is due to the parameters
--rank_threshold
and--auc_threshold
, but I am not sure.In practice, the output of AUCell seems to mostly be between 0 and 0.5, including for genes that seem to be "enriched." And, from what I can tell, downstream analysis seems to suggest using the AUC more as a summary statistic (for example, by using mixture models to binarize the distribution of the AUC).
It seems that this doesn't agree with the more conventional definition of the AUC/ROC (here). Why was this chosen? And, given that, how do we interpret the outputs of pyscenic aucell?
EDIT:
As an added note, the following code from ctxcore leads to the perverse situation where setting
auc_threshold
to 1 will lead to an assertion error: ifauc_threshold
were 1,rank_cutoff
will be equal tototal_genes
, butrank_threshold
has been set tototal_genes - 1
. Sorank_threshold
can at most betotal_genes - 1
, but then in the final line,rank_cutoff
is decremented by `. It seems that there was a double attempt to fix the 0- vs 1-indexing discrepancy between python and R, leading to an off-by-one error...Beta Was this translation helpful? Give feedback.
All reactions