ROOD-MRI: Benchmarking the Robustness of deep learning segmentation models to Out-Of-Distribution data in MRI
The examples folder contains example scripts for working with the DatasetGenerator
class and calculate_metrics
function. We will be adding more examples to this folder, stay tuned!
The basic workflow is:
- Generate a benchmarking dataset from a pre-existing test set using the
DatasetGenerator
class. - Evaluate a trained model on the benchmarking dataset, generating a dataframe or .csv file containing segmentation metrics for each sample.
- Calculate benchmarking/robustness metrics from the dataframe/.csv generated
in step 2 using
calculate_metrics
.
Dependencies for this library are listed in the requirements.txt file (including MONAI and TorchIO).
Skip this step if you're using a pre-existing benchmarking dataset (see links to existing datasets below).
If you have a dataset directory that looks like this:
/home/user/data/
|-- train_data
`-- test_data
|-- case_01
| |-- t1.nii.gz
| `-- seg_label.nii.gz
|-- case_02
| |-- t1.nii.gz
| `-- seg_label.nii.gz
.
.
.
`-- case_99
|-- t1.nii.gz
`-- seg_label.nii.gz
First, glob the files into a structured list:
from pathlib import Path
data_dir = Path('/home/user/data/')
image_paths = [str(path) for path in sorted(data_dir.glob('test_data/*/t1.nii.gz'))]
label_paths = [str(path) for path in sorted(data_dir.glob('test_data/*/seg_label.nii.gz'))]
input_files = [{'image': img, 'label': lbl} for img, lbl in zip(image_paths, label_paths)]
Then, run the DatasetGenerator
over the input files:
from roodmri.data import DatasetGenerator
out_path = '/home/user/data/benchmarking' # specify the path to put benchmarking samples
generator = DatasetGenerator(input_files, out_path)
generator.generate_dataset()
generator.save_filename_mappings(Path(out_path) / 'filename_mappings.csv') # save new filename mappings
The folder specified by out_path
will now be populated with sub-folders named Affine_1
, Affine_2
, ..., RicianNoise_4
, RicianNoise_5
, ... containing transformed samples from the test set. In the name RicianNoise_4
, RicianNoise
refers to the transform applied and 4
refers to the severity level. The image below illustrates an example of the five default severity levels on a sample T1-weighted image for (a) ghosting, (b) isotropic downsampling, and (c) MRI (Rician) noise:
For more details and examples using different initial directory structures, see the examples/dataset folder.
The end result of this step should be a csv file or dataframe with segmentation results for each benchmarking sample, as well as the original clean test set:
Model,Task,Transform,Severity,Subject_ID,DSC,HD95
unet_a,WMHs,Affine,1,Subject_001,0.82,1.41
unet_a,WMHs,Affine,1,Subject_002,0.79,2.34
.
.
unet_a,WMHs,Clean,0,Subject_001,0.85,1.41
.
.
unet_f,Ventricles,Affine,1,Subject_001,0.90,1.56
.
.
Since users' own evaluation pipelines may vary significantly (pre-processing, transforms, dataloaders, etc.), we do not provide modules to evaluate models on the benchmarking dataset. Rather, we suggest that users use their own existing pipelines to generate a csv file such as the one above. We will be uploading some of our own examples to the examples folder, including code for how to parse the transform/severity level folder name.
For more details regarding the requirements for the csv/dataframe, see metric_calculations.py in the examples folder.
After producing a csv/dataframe with segmentation results, you can use the calculate_metrics
function to generate a suite of benchmarking metrics:
from pathlib import Path
import pandas as pd
from roodmri.metrics import calculate_metrics
data_path = '/home/user/data/model_evaluation_results.csv' # change to location of csv
save_path = '/home/user/benchmarking/' # change to desired location of output files
df = pd.read_csv(data_path)
transform_level_metrics, aggregated_metrics = calculate_metrics(
df=df,
transform_col='Transform',
severity_col='Severity',
metric_cols={'DSC': True, 'HD95': False},
clean_label='Clean',
grouping_cols=['Model', 'Task']
)
transform_level_metrics.to_csv(Path(save_path) / 'transform_level_metrics.csv')
aggregated_metrics.to_csv(Path(save_path) / 'aggregated_metrics.csv')
The image below demonstrates an example of using benchmarking metrics to comparing model architectures. The numbers in the lower- and upper-left corners of the top-row and bottom-row subplots, respectively, correspond to the mean degradation for each model (top row: Dice similarity coefficient; bottom row: modified (95th percentile) Hausdorff distance):
For more documentation, see metric_calculations.py in the examples folder, or calculate.py which contains the calculate_metrics
function. For metric formulations and how to use them, check out our paper.
See the list below for download links to existing benchmarking datasets:
- Hippocampus segmentation dataset: https://www.dropbox.com/sh/t0id61jfwdq1dp9/AAAJyQLUP_6RSFjp-UOfa-Lxa?dl=0
If you like this repository, please click on Star!
If you use this package for your research, please cite our paper:
@article{boone2022rood,
title={ROOD-MRI: Benchmarking the robustness of deep learning segmentation models to out-of-distribution and corrupted data in MRI},
author={Boone, Lyndon and Biparva, Mahdi and Forooshani, Parisa Mojiri and Ramirez, Joel and Masellis, Mario and Bartha, Robert and Symons, Sean and Strother, Stephen and Black, Sandra E and Heyn, Chris and others},
journal={arXiv preprint arXiv:2203.06060},
year={2022}
}
This work has been enabled by the flexibility and modularity of the MONAI and TorchIO libraries. If you like our work and aren't familiar with theirs already, go check them out!