This website contains results, code, and pre-trained models from Deep Probability Estimation by Sheng Liu*, Aakash Kaku*, Weicheng Zhu*, Matan Leibovich*, Sreyas Mohan*, Boyang Yu, Laure Zanna, Narges Razavian, Carlos Fernandez-Granda [* - Equal Contribution].
For more information please visit our website https://jackzhu727.github.io/deep-probability-estimation/
Reliable probability estimation is of crucial importance in many real-world applications where there is inherent uncertainty, such as weather forecasting, medical prognosis, or collision avoidance in autonomous vehicles.
Probability-estimation models are trained on observed outcomes () (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities () of the events of interest are typically unknown. The problem is therefore analogous to binary classification, with the important difference that the objective is to estimate probabilities () rather than predicting the specific outcome.
Figure 1: The probability-estimation problem. In probability estimation, we assume that each observed outcome (e.g. death or survival in cancer patients) in the training set is randomly generated from a latent unobserved probability associated to the corresponding data (e.g. histopathology images).Training (left): Only and can be used for training, because is not observed. Inference (right): Given new data , the trained network produces a probability estimate in [0,1].
Prediction models based on deep learning are typically trained by minimizing the cross entropy between the model output and the training labels. This cost function is guaranteed to be well calibrated in an infinite-data regime, as illustrated by Figure 2 (first column). Unfortunately, in practice, prediction models are trained on finite data. In this case, we observe that neural networks indeed eventually overfit and memorize the observed outcomes completely. Moreover, the estimated probabilities collapse to 0 or 1 (Figure 2, second column). However, calibration is preserved during the first stages of training (Figure 2, third column). We also provide a theorectical analysis to show that such observation is a general phenomenon, intrinsic to the problem of probability estimation with finite data when the dimension is large (Theorem 4.1 in the paper).
Figure 2: When trained on infinite data (i.e. resampling outcome labels at each epoch according to ground-truth probabilities), models minimizing cross-entropy are well-calibrated (first column). The top row shows results for the synthetic Discrete scenario (top). The bottom row shows results for the Linear scenario (dashed line indicates perfect calibration). However, when trained on fixed observed outcomes, the model eventually overfits, and the probabilities collapse to either 0 or 1 (second column). This is mitigated via early stopping (i.e. selecting the model based on validation cross-entropy loss), which yields relatively good calibration (third column). The proposed Calibration Probability Estimation (CaPE) method exploits this to further improve the model discrimination while ensuring that the output remains well-calibrated.
Based on the observations in the previous section, we propose to exploit the learning dynamics of cross-entropy minimization through a method that we name Calibrated Probability Estimation (CaPE). CaPE outperforms existing approaches on most metrics on synthetic and real-world data. The pseudo-code for our proposed approach can be seen below:
- Avoids overfitting of the model.
- Improves calibration and discrimination performance of the model by exploiting early learning (Figure 2 last column).
Figure 3: Comparison between the learning curves of cross-entropy (CE) minimization and the proposed calibrated probability estimation (CaPE), smoothed with a 5-epoch moving average. After an early-learning stage where both training and validation losses decrease, CE minimization overfits (first and the second graph), with disastrous consequences in terms of probability estimation (third and fourth graph). In contrast, CaPE prevents overfitting, continuing to improve the model while maintaining calibration.
To benchmark probability-estimation methods, we build a synthetic dataset based on UTKFace (Zhang et al., 2017b), containing face images and associated ages. We use the age of the person to assign them a probability of contracting a disease. Then we simulate whether the person actually contracts the illness or not with the assigned probability.
Figure 4: Examples from Face-based risk prediction dataset (Linear scenario: The function used to convert age to a probability is a linear function).
The probability-estimation task is to estimate the assigned probability from the face image using a model that only has access to the images and the binary outcomes during training.
Figure 5: Our proposed approach outperforms existing approaches for different simulated scenarios.
Probability estimation shares similar target labels and network outputs with binary classification. However, classification accuracy is not an appropriate metric for evaluating probability-estimation models due to the inherent uncertainty of the outcomes.
- Metrics when ground truth probabilities are available For synthetic datasets, we have access to the ground truth probability labels and can use them to evaluate performance. A reasonable metric is the mean squared error () between the estimated probability and the ground truth probability.
- Metrics when ground truth probabilities are not available This is usually the case for most real-world datasets. There are several calibration metrics like ECE, MCE, KS-error, or classification metrics like Brier score and AUC that can be used to evaluate the performance of the model. But, from several metrics, which metric captures the true probability estimation performance?
To answer this question, we use the synthetic dataset to compare different metrics to the gold-standard that uses ground-truth probabilities. Brier score is found to be highly correlated with , in contrast to the classification metric AUC and the calibration metrics ECE, MCE and KS-Error.
-
Survival of Cancer Patients: Based on the Hematoxylin and Eosin slides of non-small cell lung cancers from The Cancer Genome Atlas Program (TCGA), we estimate the 5-year survival probability of cancer patients.
-
Weather Forecasting: We use the German Weather service dataset, which contains quality-controlled rainfall-depth composites from 17 operational Doppler radars. We use 30 minutes of precipitation data to predict if the mean precipitation over the area covered will increase or decrease one hour after the most recent measurement. Three precipitation maps from the past 30 minutes serve as an input.
-
Collision Prediction: We use 0.3 seconds of real dashcam videos from the YouTubeCrash dataset as input, and predict the probability of a collision in the next 2 seconds.
On all the three real-world datasets, CaPE outperforms the existing calibration approaches (when compared using the Brier score which was found to capture the probability estimation performance in the absence of the ground truth probabilities)
Our proposed approach outperforms existing approaches for different simulated scenarios.
Reliability diagrams for real-world data. Reliability diagrams computed on test data for cross-entropy minimization with early stopping, the proposed method (CaPE) and the best baseline for each dataset. Among all the methods, CaPE produces better calibrated outputs.
Please visit our GitHub page for data, pre-trained models, code, and instructions on how to use the code.