diff --git a/R/getPredictionSets.R b/R/getPredictionSets.R index 37473e9..7fef83b 100644 --- a/R/getPredictionSets.R +++ b/R/getPredictionSets.R @@ -9,7 +9,7 @@ #' #' @param x.query query data for which we want to build prediction sets. Could #' be either a SingleCellExperiment object with the estimated probabilities for -#' each cell type in the colData, or a matrix of dimension \code{n x K}, +#' each cell type in the colData, or a named matrix of dimension \code{n x K}, #' where \code{n} is the number of cells and \code{K} is the number of different #' labels. The colnames of the matrix have to correspond to the cell labels. #' @param x.cal calibration data. Could be either a @@ -17,35 +17,89 @@ #' type in the colData, or a named matrix of dimension \code{m x K}, where #' \code{m} is the number of cells and \code{K} is the number of different #' labels. The colnames of the matrix have to correspond to the cell labels. -#' @param y.cal a vector of length n with the labels of the cells in the -#' calibration data +#' @param y.cal a vector of length \code{m} with the true labels of the cells in +#' the calibration data. #' @param onto the considered section of the cell ontology as an igraph object. #' @param alpha a number between 0 and 1 that indicates the allowed miscoverage #' @param lambdas a vector of possible lambda values to be considered. Necessary #' only when \code{follow.ontology=TRUE}. #' @param follow.ontology If \code{TRUE}, then the function returns hierarchical #' prediction sets that follow the cell ontology structure. If \code{FALSE}, it -#' returns classical conformal prediction sets. +#' returns classical conformal prediction sets. See details. #' @param resample.cal Should the calibration dataset be resampled according to #' the estimated relative frequencies of cell types in the query data? #' @param labels labels of different considered cell types. Necessary if -#' \code{onto=NULL}, otherwise they are set to the leaf nodes of the provided -#' graph -#' @param return.sc if \code{TRUE}, the prediction sets are returned as a -#' SingleCellExperiment. If \code{FALSE}, they are returned as list. By default, +#' \code{onto=NULL}, otherwise they are set equal to the leaf nodes of the +#' provided graph. +#' @param return.sc parameter the controls the output. If \code{TRUE}, the +#' function returns a SingleCellExperiment. +#' If \code{FALSE}, the function returns a list. By default, #' it is set to \code{TRUE} when \code{x.query} is a SingleCellExperiment or -#' SpatialExperiment object, to \code{FALSE} when \code{x.query} is a matrix. -#' @param pr.name name of the colData variable that will contain the prediction -#' sets. Default \code{pred.set}. +#' SpatialExperiment object and to \code{FALSE} when \code{x.query} is a matrix. +#' @param pr.name name of the colData variable in the returned +#' SingleCellExperiment object that will contain the prediction +#' sets. The default name is \code{pred.set}. #' @author Daniela Corbetta -#' @return The function \code{getPredictionSets} returns -#' \item{\code{return.sc=TRUE}}{a SingleCellExperiment or SpatialExperiment +#' @return +#' \item{\code{return.sc = TRUE}}{the function \code{getPredictionSets} returns +#' a SingleCellExperiment or SpatialExperiment #' object with the prediction sets in the colData. The name of the variable #' containing the prediction sets is given by the parameter \code{pr.name}} -#' \item{\code{return.sc=FALSE}}{a list of length equal +#' \item{\code{return.sc = FALSE}}{the function \code{getPredictionSets} returns +#' a list of length equal #' to the number of cells in the test data. Each element of the list contains #' the prediction set for that cell.} -#' @references For an introduction to conformal prediction, see of +#' @details +#' \subsection{Split conformal sets}{Conformal inference is a statistical +#' framework that allows to build +#' prediction sets for any probabilistic or machine learning model. Suppose we +#' have a classification task with \eqn{K} classes. We fit a classification +#' model \eqn{\hat{f}} that outputs estimated probabilities for each class: +#' \eqn{\hat{f}(x) \in [0,1]^K}. Split conformal inference requires to reserve a +#' portion of the labelled training data, \eqn{(X_1,Y_1),\dots, (X_n,Y_n)}, to +#' be used as calibration data. Given \eqn{\hat{f}} and the calibration data, +#' the objective of conformal inference is to build, for a new observation +#' \eqn{X_{n+1},} a prediction set \eqn{C(X_{n+1}) \subseteq\{1,\dots,K\}} that +#' satisfies +#' \deqn{P\left(Y_{n+1}\in C(X_{n+1})\right) \geq 1-\alpha} +#' for a user-chosen error rate \eqn{\alpha}. Note that conformal inference is +#' distribution-free and the sets provided have finite-samples validity. +#' The only assumption is that the test data and the calibration data are +#' exchangeable. The algorithm of split conformal inference is the following: +#' \enumerate{ +#' \item For the data in the calibration set, \eqn{(Y_1,X_1),\dots, (Y_n,X_n)} +#' , obtain the \emph{conformal scores}, \eqn{s_i=1-\hat{f}(X_i)_{Y_i}, +#' \;i=1,\dots,n}. These scores will be high when the model is assigning a +#' small probability to the true class, and low otherwise. +#' \item Obtain \eqn{\hat{q}}, the +#' \eqn{\lceil(1-\alpha)(n+1)\rceil/n} empirical quantile of the conformal +#' scores. +#' \item Finally, for a new observation \eqn{X_{n+1}}, construct a prediction +#' set by including all the classes for which the estimated probability is +#' higher than \eqn{1-\hat{q}}: +#' \deqn{C(X_{n+1})=\{y: \hat{f}(X_{n+1})_y\geq 1-\hat{q}\}.} +#' }} +#' \subsection{Hierarchical conformal sets}{ +#' Let \eqn{\hat{y}(x)} be the class with maximum estimated probability. +#' Moreover, given a directed graph let \eqn{\mathcal{P}(v)} and +#' \eqn{\mathcal{A}(v)} be the set on children nodes and ancestor nodes of +#' \eqn{v}, respectively. Finally, for each node \eqn{v} define a score +#' \eqn{g(v,x)} as the sum of the predicted probabilities of the leaf nodes that +#' are children of \eqn{v}. +#' To build the sets we propose the following algorithm: \deqn{\mathcal{P}(v) +#' \cup \{\mathcal{P}(a): a\in\mathcal{A}(\hat{y}(x)): g(a,x)\leq\lambda \},} +#' where \eqn{v:v\in \mathcal{A}(\hat{y}(x)), \;g(v,x)\geq\lambda,\; +#' v=\arg\min_{u:g(u,x)\geq\lambda}g(u,x)}. +#' In words, we start from the predicted class and we go up in the graph until +#' we find an ancestor of \eqn{\hat{y}(x)} that has a score that is at least +#' \eqn{\lambda} and include in the prediction sets all its children. +#' To ensure that the sets are nested, to this subgraph we add all the other +#' ones that contain \eqn{\hat{y}(x)} for which the score is less than +#' \eqn{\lambda}. To choose \eqn{\lambda}, we follow eq. (4) in Anastasios et +#' al. (2022), considering the miscoverage as loss function. In this way, it is +#' still guaranteed that +#' \deqn{P(Y_{n+1}\notin C_\lambda (X_{n+1})) \leq \alpha.}} +#' @references For an introduction to conformal prediction, see #' Angelopoulos, Anastasios N., and Stephen Bates. "A gentle introduction to #' conformal prediction and distribution-free uncertainty quantification." #' arXiv preprint arXiv:2107.07511 (2021). @@ -76,20 +130,21 @@ getPredictionSets <- function(x.query, x.cal, y.cal, onto=NULL, alpha = 0.1, if(is.null(onto) & is.null(labels)){ stop("Please provide cell labels (labels parameter)") } - if(!is.null(return.sc)){ - if(return.sc==TRUE & is.matrix(x.query)){ - stop("If x.query is a matrix output has to be a list - (return.sc=FALSE)") - } - } if(isa(x.query, "SpatialExperiment") | isa(x.query, "SingleCellExperiment") | isa(x.query, "SummarizedExperiment")) sc <- TRUE else if(is.matrix(x.query)) sc <- FALSE - else stop("Please provide as input in x.query a SpatialExperiment, + else + stop("Please provide as input in x.query a SpatialExperiment, SingleCellExperiment or a matrix") + if(!is.null(return.sc)){ + if(return.sc==TRUE & !sc){ + stop("If x.query is a matrix output has to be a list + (return.sc=FALSE)") + } + } # Retrieve labels from the ontology (need to add retrieval from y.cal/data # when follow.ontology=FALSE) @@ -98,6 +153,7 @@ getPredictionSets <- function(x.query, x.cal, y.cal, onto=NULL, alpha = 0.1, K <- length(labels) # If input is not a matrix, retrieve prediction matrix from colData + # might turn this into a helper function if(!is.matrix(x.query)){ n.query <- ncol(x.query) p.query <- matrix(NA, nrow=n.query, ncol=K) diff --git a/man/getPredictionSets.Rd b/man/getPredictionSets.Rd index 05c1195..55ae959 100644 --- a/man/getPredictionSets.Rd +++ b/man/getPredictionSets.Rd @@ -21,7 +21,7 @@ getPredictionSets( \arguments{ \item{x.query}{query data for which we want to build prediction sets. Could be either a SingleCellExperiment object with the estimated probabilities for -each cell type in the colData, or a matrix of dimension \code{n x K}, +each cell type in the colData, or a named matrix of dimension \code{n x K}, where \code{n} is the number of cells and \code{K} is the number of different labels. The colnames of the matrix have to correspond to the cell labels.} @@ -31,8 +31,8 @@ type in the colData, or a named matrix of dimension \code{m x K}, where \code{m} is the number of cells and \code{K} is the number of different labels. The colnames of the matrix have to correspond to the cell labels.} -\item{y.cal}{a vector of length n with the labels of the cells in the -calibration data} +\item{y.cal}{a vector of length \code{m} with the true labels of the cells in +the calibration data.} \item{onto}{the considered section of the cell ontology as an igraph object.} @@ -43,29 +43,32 @@ only when \code{follow.ontology=TRUE}.} \item{follow.ontology}{If \code{TRUE}, then the function returns hierarchical prediction sets that follow the cell ontology structure. If \code{FALSE}, it -returns classical conformal prediction sets.} +returns classical conformal prediction sets. See details.} \item{resample.cal}{Should the calibration dataset be resampled according to the estimated relative frequencies of cell types in the query data?} \item{labels}{labels of different considered cell types. Necessary if -\code{onto=NULL}, otherwise they are set to the leaf nodes of the provided -graph} +\code{onto=NULL}, otherwise they are set equal to the leaf nodes of the +provided graph.} -\item{return.sc}{if \code{TRUE}, the prediction sets are returned as a -SingleCellExperiment. If \code{FALSE}, they are returned as list. By default, +\item{return.sc}{parameter the controls the output. If \code{TRUE}, the +function returns a SingleCellExperiment. +If \code{FALSE}, the function returns a list. By default, it is set to \code{TRUE} when \code{x.query} is a SingleCellExperiment or -SpatialExperiment object, to \code{FALSE} when \code{x.query} is a matrix.} +SpatialExperiment object and to \code{FALSE} when \code{x.query} is a matrix.} -\item{pr.name}{name of the colData variable that will contain the prediction -sets. Default \code{pred.set}.} +\item{pr.name}{name of the colData variable in the returned +SingleCellExperiment object that will contain the prediction +sets. The default name is \code{pred.set}.} } \value{ -The function \code{getPredictionSets} returns -\item{\code{return.sc=TRUE}}{a SingleCellExperiment or SpatialExperiment +\item{\code{return.sc = TRUE}}{the function \code{getPredictionSets} returns +a SingleCellExperiment or SpatialExperiment object with the prediction sets in the colData. The name of the variable containing the prediction sets is given by the parameter \code{pr.name}} -\item{\code{return.sc=FALSE}}{a list of length equal +\item{\code{return.sc = FALSE}}{the function \code{getPredictionSets} returns +a list of length equal to the number of cells in the test data. Each element of the list contains the prediction set for that cell.} } @@ -78,8 +81,59 @@ probabilities for each cell in the calibration and in the test data, respectively. It returns a list with the prediction sets for each cell in the test data. } +\details{ +\subsection{Split conformal sets}{Conformal inference is a statistical +framework that allows to build +prediction sets for any probabilistic or machine learning model. Suppose we +have a classification task with \eqn{K} classes. We fit a classification +model \eqn{\hat{f}} that outputs estimated probabilities for each class: +\eqn{\hat{f}(x) \in [0,1]^K}. Split conformal inference requires to reserve a +portion of the labelled training data, \eqn{(X_1,Y_1),\dots, (X_n,Y_n)}, to +be used as calibration data. Given \eqn{\hat{f}} and the calibration data, +the objective of conformal inference is to build, for a new observation +\eqn{X_{n+1},} a prediction set \eqn{C(X_{n+1}) \subseteq\{1,\dots,K\}} that +satisfies +\deqn{P\left(Y_{n+1}\in C(X_{n+1})\right) \geq 1-\alpha} +for a user-chosen error rate \eqn{\alpha}. Note that conformal inference is +distribution-free and the sets provided have finite-samples validity. +The only assumption is that the test data and the calibration data are +exchangeable. The algorithm of split conformal inference is the following: +\enumerate{ + \item For the data in the calibration set, \eqn{(Y_1,X_1),\dots, (Y_n,X_n)} + , obtain the \emph{conformal scores}, \eqn{s_i=1-\hat{f}(X_i)_{Y_i}, + \;i=1,\dots,n}. These scores will be high when the model is assigning a + small probability to the true class, and low otherwise. + \item Obtain \eqn{\hat{q}}, the + \eqn{\lceil(1-\alpha)(n+1)\rceil/n} empirical quantile of the conformal + scores. + \item Finally, for a new observation \eqn{X_{n+1}}, construct a prediction + set by including all the classes for which the estimated probability is + higher than \eqn{1-\hat{q}}: + \deqn{C(X_{n+1})=\{y: \hat{f}(X_{n+1})_y\geq 1-\hat{q}\}.} +}} +\subsection{Hierarchical conformal sets}{ +Let \eqn{\hat{y}(x)} be the class with maximum estimated probability. +Moreover, given a directed graph let \eqn{\mathcal{P}(v)} and +\eqn{\mathcal{A}(v)} be the set on children nodes and ancestor nodes of +\eqn{v}, respectively. Finally, for each node \eqn{v} define a score +\eqn{g(v,x)} as the sum of the predicted probabilities of the leaf nodes that +are children of \eqn{v}. +To build the sets we propose the following algorithm: \deqn{\mathcal{P}(v) +\cup \{\mathcal{P}(a): a\in\mathcal{A}(\hat{y}(x)): g(a,x)\leq\lambda \},} +where \eqn{v:v\in \mathcal{A}(\hat{y}(x)), \;g(v,x)\geq\lambda,\; +v=\arg\min_{u:g(u,x)\geq\lambda}g(u,x)}. +In words, we start from the predicted class and we go up in the graph until +we find an ancestor of \eqn{\hat{y}(x)} that has a score that is at least +\eqn{\lambda} and include in the prediction sets all its children. +To ensure that the sets are nested, to this subgraph we add all the other +ones that contain \eqn{\hat{y}(x)} for which the score is less than +\eqn{\lambda}. To choose \eqn{\lambda}, we follow eq. (4) in Anastasios et +al. (2022), considering the miscoverage as loss function. In this way, it is +still guaranteed that +\deqn{P(Y_{n+1}\notin C_\lambda (X_{n+1})) \leq \alpha.}} +} \references{ -For an introduction to conformal prediction, see of +For an introduction to conformal prediction, see Angelopoulos, Anastasios N., and Stephen Bates. "A gentle introduction to conformal prediction and distribution-free uncertainty quantification." arXiv preprint arXiv:2107.07511 (2021).