Why are the results of AUCell not centered around 0.5 (for a random "regulon")? #440

scyrusm · 2024-01-22T21:10:43Z

scyrusm
Jan 22, 2024

Hi there,

I'm trying to better understand how to interpret the output of AUCell. In particular, if we were to calculate the ranking of genes in a regulon, and plot on the x axis the ranking, and the y axis the cumulative number of genes in the regulon at or above the corresponding ranking, we would have a typical AUC/ROC curve. A "regulon" consisting of randomly selected genes would have an AUC of 0.5, the maximally enriched regulon having and AUC of 1, and the maximally suppressed having one bounded below by 0. This is equivalent (ignoring the case of ties) to the common language effect size between the rankings of the genes in the regulon and the genes not in the regulon, on a cell-by-cell basis.

I suspect that this is due to the parameters --rank_threshold and --auc_threshold, but I am not sure.

In practice, the output of AUCell seems to mostly be between 0 and 0.5, including for genes that seem to be "enriched." And, from what I can tell, downstream analysis seems to suggest using the AUC more as a summary statistic (for example, by using mixture models to binarize the distribution of the AUC).

It seems that this doesn't agree with the more conventional definition of the AUC/ROC (here). Why was this chosen? And, given that, how do we interpret the outputs of pyscenic aucell?

EDIT:
As an added note, the following code from ctxcore leads to the perverse situation where setting auc_threshold to 1 will lead to an assertion error: if auc_threshold were 1, rank_cutoff will be equal to total_genes, but rank_threshold has been set to total_genes - 1. So rank_threshold can at most be total_genes - 1, but then in the final line, rank_cutoff is decremented by `. It seems that there was a double attempt to fix the 0- vs 1-indexing discrepancy between python and R, leading to an off-by-one error...

def derive_rank_cutoff(
auc_threshold: float, total_genes: int, rank_threshold: Optional[int] = None
) -> int:
"""
Get rank cutoff.

:param auc_threshold: The fraction of the ranked genome to take into account for
    the calculation of the Area Under the recovery Curve.
:param total_genes: The total number of genes ranked.
:param rank_threshold: The total number of ranked genes to take into account when
    creating a recovery curve.
:return Rank cutoff.
"""

if not rank_threshold:
    rank_threshold = total_genes - 1

assert (
    0 < rank_threshold < total_genes
), f"Rank threshold must be an integer between 1 and {total_genes:d}."
assert (
    0.0 < auc_threshold <= 1.0
), "AUC threshold must be a fraction between 0.0 and 1.0."

# In the R implementation the cutoff is rounded.
rank_cutoff = int(round(auc_threshold * total_genes))
assert 0 < rank_cutoff <= rank_threshold, (
    f"An AUC threshold of {auc_threshold:f} corresponds to {rank_cutoff:d} top "                                                                                                                               
    f"ranked genes/regions in the database. Please increase the rank threshold "                                                                                                                               
    "or decrease the AUC threshold."                                                                                                                                                                           
)                                                                                                                                                                                                              
# Make sure we have exactly the same AUC values as the R-SCENIC pipeline.                                                                                                                                      
# In the latter the rank threshold is not included in AUC calculation.                                                                                                                                         
rank_cutoff -= 1
return rank_cutoff

mapo121 · 2025-01-22T20:27:47Z

mapo121
Jan 22, 2025

Enrichment values in AUCell are calculated based on the relative ranking of genes within each individual cell, meaning that the AUC scores are interpreted within the context of each cell’s own gene ranking. In other words, AUCell calculates how enriched or active a particular regulon is relative to other genes within the same cell. This means that the AUC score is cell-specific and reflects the gene activity within the cell as compared to other cells in the dataset.

If you were expecting AUC values to be centered around 0.5 (as you might see in traditional ROC curves for binary classification), it’s important to understand that in AUCell’s case, AUC scores reflect the ranking of genes in a particular regulon. A higher AUC indicates that the regulon’s genes are more enriched and ranked higher relative to other genes within that specific cell. On the other hand, an AUC closer to 0 suggests the regulon’s genes are ranked lower relative to the rest of the genes in that cell.

Therefore, the AUC score is not globally centered around 0.5, as might be the case in a typical classification task, but rather reflects how each cell ranks the genes within a regulon relative to the rest of the genes in the same cell. This allows AUCell to capture cell-specific gene activity patterns rather than assuming a uniform or global scale across all cells.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are the results of AUCell not centered around 0.5 (for a random "regulon")? #440

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Why are the results of AUCell not centered around 0.5 (for a random "regulon")? #440

scyrusm Jan 22, 2024

Replies: 1 comment

mapo121 Jan 22, 2025

scyrusm
Jan 22, 2024

mapo121
Jan 22, 2025