From b388e8e728a3f539d1665caaa4211c7155cc224a Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Wed, 27 Nov 2024 14:02:21 -0500
Subject: [PATCH 01/14] fix: removed repeated documentation on nnUnet page

---
 docs/nnUNet.md | 117 +------------------------------------------------
 1 file changed, 1 insertion(+), 116 deletions(-)

diff --git a/docs/nnUNet.md b/docs/nnUNet.md
index 26e01e40..b51fb747 100644
--- a/docs/nnUNet.md
+++ b/docs/nnUNet.md
@@ -114,119 +114,4 @@ To run inference, run the command:
 nnUNet_predict -i INPUT_FOLDER -o OUTPUT_FOLDER -t TASK_NAME_OR_ID -m CONFIGURATION
 ```
 
-In this case, the `INPUT_FOLDER` of nnUNet is the `OUTPUT_DIRECTORY` of Med-ImageTools.# Preparing Data for nnUNet
-
-nnUNet repo can be found at: <https://github.com/MIC-DKFZ/nnUNet>
-
-## Processing DICOM Data with Med-ImageTools
-
-Ensure that you have followed the steps in <https://github.com/bhklab/med-imagetools#installing-med-imagetools> before proceeding.
-
-To convert your data from DICOM to NIfTI for training an nnUNet auto-segmentation model, run the following command:
-
-```sh
-autopipeline\
-  [INPUT_DIRECTORY] \
-  [OUTPUT_DIRECTORY] \
-  --modalities CT,RTSTRUCT \
-  --nnunet
-```
-
-Modalities can also be set to `--modalities MR,RTSTRUCT`
-
-AutoPipeline offers many more options and features for you to customize your outputs: <https://github.com/bhklab/med-imagetools/imgtools/README.md>.
-
-## nnUNet Preprocess and Train
-
-### One-Step Preprocess and Train
-
-Med-ImageTools generates a file in your output folder called `nnunet_preprocess_and_train.sh` that combines all the commands needed for preprocessing and training your nnUNet model. Run that shell script to get a fully trained nnUNet model.
-
-Alternatively, you can go through each step individually as follows below:
-
-### nnUNet Preprocessing
-
-Follow the instructions for setting up your paths for nnUNet: <https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/setting_up_paths.md>
-
-Med-ImageTools generates the dataset.json that nnUNet requires in the output directory that you specify.
-
-The generated output directory structure will look something like:
-
-```sh
-OUTPUT_DIRECTORY
-├── nnUNet_preprocessed
-├── nnUNet_raw_data_base
-│   └── nnUNet_raw_data
-│       └── Task500_HNSCC
-│           ├── nnunet_preprocess_and_train.sh
-│           └── ...
-└── nnUNet_trained_models
-```
-
-nnUNet requires that environment variables be set before any commands are executed. To temporarily set them, run the following:
-
-```sh
-export nnUNet_raw_data_base="/OUTPUT_DIRECTORY/nnUNet_raw_data_base"
-export nnUNet_preprocessed="/OUTPUT_DIRECTORY/nnUNet_preprocessed"
-export RESULTS_FOLDER="/OUTPUT_DIRECTORY/nnUNet_trained_models"
-```
-
-To permanently set these environment variables, make sure that in your `~/.bashrc` file, these environment variables are set for nnUNet. The `nnUNet_preprocessed` and `nnUNet_trained_models` folders are generated as empty folders for you by Med-ImageTools. `nnUNet_raw_data_base` is populated with the required raw data files. Add this to the file:
-
-```sh
-export nnUNet_raw_data_base="/OUTPUT_DIRECTORY/nnUNet_raw_data_base"
-export nnUNet_preprocessed="/OUTPUT_DIRECTORY/nnUNet_preprocessed"
-export RESULTS_FOLDER="/OUTPUT_DIRECTORY/nnUNet_trained_models"
-```
-
-Then, execute the command:
-
-```sh
-source ~/.bashrc
-```
-
-Too allow nnUNet to preprocess your data for training, run the following command. Set XXX to the ID that you want to preprocess. This is your task ID. For example, for Task500_HNSCC, the task ID is 500. Task IDs must be between 500 and 999, so Med-ImageTools can run 500 instances with the `--nnunet` flag in a single output folder.
-
-```sh
-nnUNet_plan_and_preprocess -t XXX --verify_dataset_integrity
-```
-
-### nnUNet Training
-
-Once nnUNet has finished preprocessing, you may begin training your nnUNet model. To train your model, run the following command. Learn more about nnUNet's options here: <https://github.com/MIC-DKFZ/nnUNet#model-training>
-
-```sh
-nnUNet_train CONFIGURATION TRAINER_CLASS_NAME TASK_NAME_OR_ID FOLD
-```
-
-## nnUNet Inference
-
-For inference data, nnUNet requires data to be in a different output format. To run AutoPipeline for nnUNet inference, run the following command:
-
-```sh
-autopipeline\
-  [INPUT_DIRECTORY] \
-  [OUTPUT_DIRECTORY] \
-  --modalities CT \
-  --nnunet_inference \
-  --dataset_json_path [DATASET_JSON_PATH]
-```
-To execute this command AutoPipeline needs a json file with the image modality definitions.
-
-Modalities can also be set to `--modalities MR`.
-
-The directory structue will look like:
-
-```sh
-OUTPUT_DIRECTORY
-├── 0_subject1_0000.nii.gz
-└── ...
-```
-
-To run inference, run the command:
-
-```sh
-nnUNet_predict -i INPUT_FOLDER -o OUTPUT_FOLDER -t TASK_NAME_OR_ID -m CONFIGURATION
-```
-
-In this case, the `INPUT_FOLDER` of nnUNet is the `OUTPUT_DIRECTORY` of Med-ImageTools.
+In this case, the `INPUT_FOLDER` of nnUNet is the `OUTPUT_DIRECTORY` of Med-ImageTools.# Preparing Data for nnUNet
\ No newline at end of file

From 90bd7d8b3611b791236009c51ade3729ef2ee750 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Wed, 27 Nov 2024 14:02:58 -0500
Subject: [PATCH 02/14] fix: removed repeated documentation on nnunet page

---
 docs/nnUNet.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/nnUNet.md b/docs/nnUNet.md
index b51fb747..f9471f77 100644
--- a/docs/nnUNet.md
+++ b/docs/nnUNet.md
@@ -19,7 +19,7 @@ autopipeline\
 Modalities can also be set to `--modalities MR,RTSTRUCT`
 
 AutoPipeline offers many more options and features for you to customize your outputs: <<https://github.com/bhklab/med-imagetools/tree/main/README.md>  
->.
+
 
 ## nnUNet Preprocess and Train
 
@@ -114,4 +114,4 @@ To run inference, run the command:
 nnUNet_predict -i INPUT_FOLDER -o OUTPUT_FOLDER -t TASK_NAME_OR_ID -m CONFIGURATION
 ```
 
-In this case, the `INPUT_FOLDER` of nnUNet is the `OUTPUT_DIRECTORY` of Med-ImageTools.# Preparing Data for nnUNet
\ No newline at end of file
+In this case, the `INPUT_FOLDER` of nnUNet is the `OUTPUT_DIRECTORY` of Med-ImageTools.
\ No newline at end of file

From 172b5954446a76be60b300227eabcf347771175c Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Wed, 27 Nov 2024 14:10:33 -0500
Subject: [PATCH 03/14] feat: updated to generate_dataset_json func from
 nnunetv2

---
 src/imgtools/autopipeline.py |  25 +++--
 src/imgtools/utils/nnunet.py | 171 ++++++++++++++++++++---------------
 2 files changed, 114 insertions(+), 82 deletions(-)

diff --git a/src/imgtools/autopipeline.py b/src/imgtools/autopipeline.py
index 5d9759fe..cd16bd55 100644
--- a/src/imgtools/autopipeline.py
+++ b/src/imgtools/autopipeline.py
@@ -608,16 +608,21 @@ def save_data(self):
 
         shutil.rmtree(pathlib.Path(self.output_directory, ".temp").as_posix())
 
-        # Save dataset json
-        if self.is_nnunet:  # dataset.json for nnunet and .sh file to run to process it
-            imagests_path = pathlib.Path(self.output_directory, "imagesTs").as_posix()
-            images_test_location = imagests_path if os.path.exists(imagests_path) else None
-            generate_dataset_json(pathlib.Path(self.output_directory, "dataset.json").as_posix(),
-                                  pathlib.Path(self.output_directory, "imagesTr").as_posix(),
-                                  images_test_location,
-                                  tuple(self.nnunet_info["modalities"].keys()),
-                                  {v: k for k, v in self.existing_roi_indices.items()},
-                                  os.path.split(self.input_directory)[1])
+        if self.is_nnunet: 
+            # Generate the dataset JSON
+            channel_names_mapping = { # Earlier generated as {"CT": ""0000"} now needed as {"0": "CT"}
+                self.nnunet_info["modalities"][k].lstrip('0') or '0': k  
+                for k in self.nnunet_info["modalities"].keys()
+            }
+            generate_dataset_json(
+                output_folder=pathlib.Path(self.output_directory).as_posix(), 
+                channel_names=channel_names_mapping,
+                labels=self.existing_roi_indices,     
+                file_ending='.nii.gz',
+                num_training_cases=len(self.train)               
+            )
+
+            # .sh file for training
             _, child = os.path.split(self.output_directory)
             shell_path = pathlib.Path(self.output_directory, child.split("_")[1]+".sh").as_posix()
             if os.path.exists(shell_path):
diff --git a/src/imgtools/utils/nnunet.py b/src/imgtools/utils/nnunet.py
index 584a7534..81ae75d0 100644
--- a/src/imgtools/utils/nnunet.py
+++ b/src/imgtools/utils/nnunet.py
@@ -22,82 +22,109 @@ def markdown_report_images(output_folder, modality_count):
     plt.pie([train_total, test_total], labels=[f"Train - {train_total}", f"Test - {test_total}"])
     plt.savefig(pathlib.Path(output_folder, "markdown_images", "nnunet_train_test_pie.png").as_posix())
 
-# this code is taken from:
-# Division of Medical Image Computing, German Cancer Research Center (DKFZ)
-# in the nnUNet and batchgenerator repositories
-
 
 def save_json(obj, file: str, indent: int = 4, sort_keys: bool = True) -> None:
     with open(file, 'w') as f:
         json.dump(obj, f, sort_keys=sort_keys, indent=indent)
 
-
-def get_identifiers_from_splitted_files(folder: str):
-    uniques = np.unique([i[:-12] for i in subfiles(folder, suffix='.nii.gz', join=False)])
-    return uniques
-
-
-def subfiles(folder: str, join: bool = True, prefix: str = None, suffix: str = None, sort: bool = True) -> List[str]:
-    if join:
-        path_fn = os.path.join
-    else:
-        def path_fn(x, y): return y
-        
-    res = [path_fn(folder, i) for i in os.listdir(folder) if os.path.isfile(os.path.join(folder, i))
-           and (prefix is None or i.startswith(prefix))
-           and (suffix is None or i.endswith(suffix))]
-    if sort:
-        res.sort()
-    return res
-
-
-def generate_dataset_json(output_file: str, imagesTr_dir: str, imagesTs_dir: str, modalities: Tuple,
-                          labels: dict, dataset_name: str, sort_keys=True, license: str = "hands off!", dataset_description: str = "",
-                          dataset_reference="", dataset_release='0.0'):
+# Code take from: https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunetv2/dataset_conversion/generate_dataset_json.py
+
+def generate_dataset_json(output_folder: str,
+                          channel_names: dict,
+                          labels: dict,
+                          num_training_cases,
+                          file_ending: str,
+                          regions_class_order: Tuple[int, ...] = None,
+                          dataset_name: str = None, reference: str = None, release: str = None, license: str = 'hands off!',
+                          description: str = None,
+                          overwrite_image_reader_writer: str = None, **kwargs):
     """
-    :param output_file: This needs to be the full path to the dataset.json you intend to write, so
-    output_file='DATASET_PATH/dataset.json' where the folder DATASET_PATH points to is the one with the
-    imagesTr and labelsTr subfolders
-    :param imagesTr_dir: path to the imagesTr folder of that dataset
-    :param imagesTs_dir: path to the imagesTs folder of that dataset. Can be None
-    :param modalities: tuple of strings with modality names. must be in the same order as the images (first entry
-    corresponds to _0000.nii.gz, etc). Example: ('T1', 'T2', 'FLAIR').
-    :param labels: dict with int->str (key->value) mapping the label IDs to label names. Note that 0 is always
-    supposed to be background! Example: {0: 'background', 1: 'edema', 2: 'enhancing tumor'}
-    :param dataset_name: The name of the dataset. Can be anything you want
-    :param sort_keys: In order to sort or not, the keys in dataset.json
-    :param license:
-    :param dataset_description:
-    :param dataset_reference: website of the dataset, if available
-    :param dataset_release:
-    :return:
+    Generates a dataset.json file in the output folder
+
+    channel_names:
+        Channel names must map the index to the name of the channel, example:
+        {
+            0: 'T1',
+            1: 'CT'
+        }
+        Note that the channel names may influence the normalization scheme!! Learn more in the documentation.
+
+    labels:
+        This will tell nnU-Net what labels to expect. Important: This will also determine whether you use region-based training or not.
+        Example regular labels:
+        {
+            'background': 0,
+            'left atrium': 1,
+            'some other label': 2
+        }
+        Example region-based training:
+        {
+            'background': 0,
+            'whole tumor': (1, 2, 3),
+            'tumor core': (2, 3),
+            'enhancing tumor': 3
+        }
+
+        Remember that nnU-Net expects consecutive values for labels! nnU-Net also expects 0 to be background!
+
+    num_training_cases: is used to double check all cases are there!
+
+    file_ending: needed for finding the files correctly. IMPORTANT! File endings must match between images and
+    segmentations!
+
+    dataset_name, reference, release, license, description: self-explanatory and not used by nnU-Net. Just for
+    completeness and as a reminder that these would be great!
+
+    overwrite_image_reader_writer: If you need a special IO class for your dataset you can derive it from
+    BaseReaderWriter, place it into nnunet.imageio and reference it here by name
+
+    kwargs: whatever you put here will be placed in the dataset.json as well
+
     """
-    train_identifiers = get_identifiers_from_splitted_files(imagesTr_dir)
-
-    if imagesTs_dir is not None:
-        test_identifiers = get_identifiers_from_splitted_files(imagesTs_dir)
-    else:
-        test_identifiers = []
-
-    json_dict = {}
-    json_dict['name'] = dataset_name
-    json_dict['description'] = dataset_description
-    json_dict['tensorImageSize'] = "4D"
-    json_dict['reference'] = dataset_reference
-    json_dict['licence'] = license
-    json_dict['release'] = dataset_release
-    json_dict['modality'] = {str(i): modalities[i] for i in range(len(modalities))}
-    json_dict['labels'] = {str(i): labels[i] for i in labels.keys()}
-
-    json_dict['numTraining'] = len(train_identifiers)
-    json_dict['numTest'] = len(test_identifiers)
-    json_dict['training'] = [
-        {'image': "./imagesTr/%s.nii.gz" % i, "label": "./labelsTr/%s.nii.gz" % i} for i
-        in
-        train_identifiers]
-    json_dict['test'] = ["./imagesTs/%s.nii.gz" % i for i in test_identifiers]
-
-    if not output_file.endswith("dataset.json"):
-        print("WARNING: output file name is not dataset.json! This may be intentional or not. You decide. "
-              "Proceeding anyways...")
-    save_json(json_dict, os.path.join(output_file), sort_keys=sort_keys)
+
+    has_regions: bool = any([isinstance(i, (tuple, list)) and len(i) > 1 for i in labels.values()])
+    if has_regions:
+        assert regions_class_order is not None, f"You have defined regions but regions_class_order is not set. " \
+                                                f"You need that."
+    # channel names need strings as keys
+    keys = list(channel_names.keys())
+    for k in keys:
+        if not isinstance(k, str):
+            channel_names[str(k)] = channel_names[k]
+            del channel_names[k]
+
+    # labels need ints as values
+    for l in labels.keys():
+        value = labels[l]
+        if isinstance(value, (tuple, list)):
+            value = tuple([int(i) for i in value])
+            labels[l] = value
+        else:
+            labels[l] = int(labels[l])
+
+    dataset_json = {
+        'channel_names': channel_names,  # previously this was called 'modality'. I didn't like this so this is
+        # channel_names now. Live with it.
+        'labels': labels,
+        'numTraining': num_training_cases,
+        'file_ending': file_ending,
+    }
+
+    if dataset_name is not None:
+        dataset_json['name'] = dataset_name
+    if reference is not None:
+        dataset_json['reference'] = reference
+    if release is not None:
+        dataset_json['release'] = release
+    if license is not None:
+        dataset_json['licence'] = license
+    if description is not None:
+        dataset_json['description'] = description
+    if overwrite_image_reader_writer is not None:
+        dataset_json['overwrite_image_reader_writer'] = overwrite_image_reader_writer
+    if regions_class_order is not None:
+        dataset_json['regions_class_order'] = regions_class_order
+
+    dataset_json.update(kwargs)
+
+    save_json(dataset_json, pathlib.Path(output_folder) / 'dataset.json', sort_keys=False)
\ No newline at end of file

From bf69ab270293a49d8d6373435a03d4b11d194138 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Wed, 27 Nov 2024 15:05:34 -0500
Subject: [PATCH 04/14] feat: changed to nnunetv2 dataset naming
 convention(TaskXXX_NAME - > DatasetXXX_Name)

---
 src/imgtools/autopipeline.py | 56 ++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 19 deletions(-)

diff --git a/src/imgtools/autopipeline.py b/src/imgtools/autopipeline.py
index cd16bd55..c57ba7ec 100644
--- a/src/imgtools/autopipeline.py
+++ b/src/imgtools/autopipeline.py
@@ -148,7 +148,6 @@ def __init__(self,
         if not nnunet and continue_processing and not os.path.exists(pathlib.Path(output_directory, ".temp").as_posix()):
             raise FileNotFoundError(f"Cannot continue processing. .temp directory does not exist in {output_directory}. Run without --continue_processing to start from scratch.")
 
-        study_name = os.path.split(self.input_directory)[1]
         if nnunet_inference:
             roi_yaml_path = ""
             custom_train_test_split = False
@@ -165,25 +164,44 @@ def __init__(self,
                                                  "nnUNet_raw_data").as_posix()
             if not os.path.exists(self.output_directory):
                 os.makedirs(self.output_directory)
+
             all_nnunet_folders = glob.glob(pathlib.Path(self.output_directory, "*", " ").as_posix())
-            numbers = [int(os.path.split(os.path.split(folder)[0])[1][4:7]) for folder in all_nnunet_folders if os.path.split(os.path.split(folder)[0])[1].startswith("Task")]
-            if (len(numbers) == 0 and continue_processing) or not continue_processing or not os.path.exists(pathlib.Path(self.output_directory, f"Task{max(numbers)}_{study_name}", ".temp").as_posix()):
-                available_numbers = list(range(500, 1000))
-                for folder in all_nnunet_folders:
-                    folder_name = os.path.split(os.path.split(folder)[0])[1]
-                    if folder_name.startswith("Task") and folder_name[4:7].isnumeric() and int(folder_name[4:7]) in available_numbers:
-                        available_numbers.remove(int(folder_name[4:7]))
-                if len(available_numbers) == 0:
-                    raise Error("There are not enough task ID's for the nnUNet output. Please make sure that there is at least one task ID available between 500 and 999, inclusive")
-                task_folder_name = f"Task{available_numbers[0]}_{study_name}"
-                self.output_directory = pathlib.Path(self.output_directory, task_folder_name).as_posix()
-                self.task_id = available_numbers[0]
+
+            # Extract used dataset IDs from folder names that match the "Dataset###_" format
+            used_ids = {
+                int(pathlib.Path(folder).parent.parent.name[7:10]) 
+                for folder in all_nnunet_folders
+                if pathlib.Path(folder).parent.parent.name.startswith("Dataset")
+            }
+
+            study_name = pathlib.Path(self.input_directory).name
+            new_dataset_required = (
+                not used_ids  # No existing datasets
+                or not continue_processing  # Processing shouldn't continue with existing datasets
+                or not pathlib.Path(self.output_directory, f"Dataset{max(used_ids):03}_{study_name}", ".temp").exists()  # Temp folder missing
+            )
+
+            if new_dataset_required:
+                all_ids = set(range(1, 1000))
+                available_ids = sorted(all_ids - used_ids)
+                if not available_ids:
+                    raise Error(
+                        "There are not enough dataset IDs for the nnUNet output. "
+                        "Please ensure at least one dataset ID is available between 001 and 999, inclusive."
+                    )
+                dataset_id = available_ids[0]  # Assign the first available dataset ID
             else:
-                self.task_id = max(numbers)
-                task_folder_name = f"Task{self.task_id}_{study_name}"
-                self.output_directory = pathlib.Path(self.output_directory, task_folder_name).as_posix()
-            if not os.path.exists(pathlib.Path(self.output_directory, ".temp").as_posix()):
-                os.makedirs(pathlib.Path(self.output_directory, ".temp").as_posix())
+                dataset_id = max(used_ids)  # Reuse the highest existing dataset ID
+
+            self.dataset_id = dataset_id 
+
+            # Create the dataset folder name and update the output directory path
+            dataset_folder_name = f"Dataset{self.dataset_id:03}_{study_name}"
+            self.output_directory = pathlib.Path(self.output_directory, dataset_folder_name).as_posix()
+
+            temp_folder_path = pathlib.Path(self.output_directory, ".temp")
+            if not temp_folder_path.exists():
+                os.makedirs(temp_folder_path.as_posix())
         
         if not dry_run:
             # Make a directory
@@ -633,7 +651,7 @@ def save_data(self):
                 output += f'export nnUNet_raw_data_base="{self.base_output_directory}/nnUNet_raw_data_base"\n'
                 output += f'export nnUNet_preprocessed="{self.base_output_directory}/nnUNet_preprocessed"\n'
                 output += f'export RESULTS_FOLDER="{self.base_output_directory}/nnUNet_trained_models"\n\n'
-                output += f'nnUNet_plan_and_preprocess -t {self.task_id} --verify_dataset_integrity\n\n'
+                output += f'nnUNet_plan_and_preprocess -t {self.dataset_id} --verify_dataset_integrity\n\n'
                 output += 'for (( i=0; i<5; i++ ))\n'
                 output += 'do\n'
                 output += f'    nnUNet_train 3d_fullres nnUNetTrainerV2 {os.path.split(self.output_directory)[1]} $i --npz\n'

From bca5c54ed9d4d815dbe29829a5513e2bf2bdc622 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Wed, 27 Nov 2024 15:35:51 -0500
Subject: [PATCH 05/14] feat: updated to nnunet folder
 convention(https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/setting_up_paths.md)

---
 src/imgtools/autopipeline.py | 39 +++++++++---------------------------
 src/imgtools/utils/nnunet.py | 38 +++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 30 deletions(-)

diff --git a/src/imgtools/autopipeline.py b/src/imgtools/autopipeline.py
index c57ba7ec..ef65a1b9 100644
--- a/src/imgtools/autopipeline.py
+++ b/src/imgtools/autopipeline.py
@@ -13,7 +13,7 @@
 
 from imgtools.ops import StructureSetToSegmentation, ImageAutoInput, ImageAutoOutput, Resample
 from imgtools.pipeline import Pipeline
-from imgtools.utils.nnunet import generate_dataset_json, markdown_report_images
+from imgtools.utils.nnunet import generate_dataset_json, create_train_script, markdown_report_images
 from imgtools.utils.args import parser
 from imgtools.logging import logger
 
@@ -155,15 +155,12 @@ def __init__(self,
             if modalities != "CT" and modalities != "MR":
                 raise ValueError("nnUNet inference can only be run on image files. Please set modalities to 'CT' or 'MR'")
         if nnunet:
-            self.base_output_directory = self.output_directory
-            if not os.path.exists(pathlib.Path(self.output_directory, "nnUNet_preprocessed").as_posix()):
-                os.makedirs(pathlib.Path(self.output_directory, "nnUNet_preprocessed").as_posix())
-            if not os.path.exists(pathlib.Path(self.output_directory, "nnUNet_trained_models").as_posix()):
-                os.makedirs(pathlib.Path(self.output_directory, "nnUNet_trained_models").as_posix())
-            self.output_directory = pathlib.Path(self.output_directory, "nnUNet_raw_data_base",
-                                                 "nnUNet_raw_data").as_posix()
-            if not os.path.exists(self.output_directory):
-                os.makedirs(self.output_directory)
+        
+            pathlib.Path(self.output_directory, "nnUNet_results").mkdir(parents=True, exist_ok=True)
+            pathlib.Path(self.output_directory, "nnUNet_preprocessed").mkdir(parents=True, exist_ok=True)
+            raw_path = pathlib.Path(self.output_directory, "nnUNet_raw")
+            raw_path.mkdir(parents=True, exist_ok=True)
+            self.output_directory = raw_path.as_posix()
 
             all_nnunet_folders = glob.glob(pathlib.Path(self.output_directory, "*", " ").as_posix())
 
@@ -627,7 +624,6 @@ def save_data(self):
         shutil.rmtree(pathlib.Path(self.output_directory, ".temp").as_posix())
 
         if self.is_nnunet: 
-            # Generate the dataset JSON
             channel_names_mapping = { # Earlier generated as {"CT": ""0000"} now needed as {"0": "CT"}
                 self.nnunet_info["modalities"][k].lstrip('0') or '0': k  
                 for k in self.nnunet_info["modalities"].keys()
@@ -639,25 +635,8 @@ def save_data(self):
                 file_ending='.nii.gz',
                 num_training_cases=len(self.train)               
             )
-
-            # .sh file for training
-            _, child = os.path.split(self.output_directory)
-            shell_path = pathlib.Path(self.output_directory, child.split("_")[1]+".sh").as_posix()
-            if os.path.exists(shell_path):
-                os.remove(shell_path)
-            with open(shell_path, "w", newline="\n") as f:
-                output = "#!/bin/bash\n"
-                output += "set -e"
-                output += f'export nnUNet_raw_data_base="{self.base_output_directory}/nnUNet_raw_data_base"\n'
-                output += f'export nnUNet_preprocessed="{self.base_output_directory}/nnUNet_preprocessed"\n'
-                output += f'export RESULTS_FOLDER="{self.base_output_directory}/nnUNet_trained_models"\n\n'
-                output += f'nnUNet_plan_and_preprocess -t {self.dataset_id} --verify_dataset_integrity\n\n'
-                output += 'for (( i=0; i<5; i++ ))\n'
-                output += 'do\n'
-                output += f'    nnUNet_train 3d_fullres nnUNetTrainerV2 {os.path.split(self.output_directory)[1]} $i --npz\n'
-                output += 'done'
-                f.write(output)
-            markdown_report_images(self.output_directory, self.total_modality_counter)  # images saved to the output directory
+            create_train_script(self.output_directory, self.dataset_id)
+            markdown_report_images(self.output_directory, self.total_modality_counter)
         
         # Save summary info (factor into different file)
         markdown_path = pathlib.Path(self.output_directory, "report.md").as_posix()
diff --git a/src/imgtools/utils/nnunet.py b/src/imgtools/utils/nnunet.py
index 81ae75d0..6f96ec58 100644
--- a/src/imgtools/utils/nnunet.py
+++ b/src/imgtools/utils/nnunet.py
@@ -27,6 +27,44 @@ def save_json(obj, file: str, indent: int = 4, sort_keys: bool = True) -> None:
     with open(file, 'w') as f:
         json.dump(obj, f, sort_keys=sort_keys, indent=indent)
 
+def create_train_script(output_directory, dataset_id):
+    """
+    Creates a bash script (`train.sh`) for running nnUNet training, with paths for raw data,
+    preprocessed data, and trained models. The script ensures environment variables are set and 
+    executes the necessary training commands.
+
+    Parameters:
+    - output_directory (str): The directory where the output and subdirectories are located.
+    - dataset_id (int): The ID of the dataset to be processed.
+    """
+    # Define paths using pathlib
+    output_directory = pathlib.Path(output_directory)
+    shell_path = output_directory / 'train.sh'
+    base_dir = output_directory.parent.parent
+
+    if shell_path.exists():
+        shell_path.unlink()
+
+    # Define the environment variables and the script commands
+    script_content = f"""#!/bin/bash
+set -e
+
+export nnUNet_raw="{base_dir}/nnUNet_raw"
+export nnUNet_preprocessed="{base_dir}/nnUNet_preprocessed"
+export nnUNet_results="{base_dir}/nnUNet_trained_models"
+
+nnUNet_plan_and_preprocess -t {dataset_id} --verify_dataset_integrity
+
+for (( i=0; i<5; i++ ))
+do
+    nnUNet_train 3d_fullres nnUNetTrainerV2 {output_directory.name} $i --npz
+done
+"""
+
+    # Write the script content to the file
+    with open(shell_path, "w", newline="\n") as f:
+        f.write(script_content)
+
 # Code take from: https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunetv2/dataset_conversion/generate_dataset_json.py
 
 def generate_dataset_json(output_folder: str,

From 908e8e2eaebe7d0f161cc23f2642d602e6e2122c Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Wed, 27 Nov 2024 16:51:08 -0500
Subject: [PATCH 06/14] feat: changed to nnunetv2 file
 naming(0_RADCURE-0005_0000.nii.gz ->  RADCURE-0005_000_0000.nii.gz)

---
 src/imgtools/autopipeline.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/imgtools/autopipeline.py b/src/imgtools/autopipeline.py
index ef65a1b9..7f506fc5 100644
--- a/src/imgtools/autopipeline.py
+++ b/src/imgtools/autopipeline.py
@@ -376,7 +376,6 @@ def process_one_subject(self, subject_id):
             if os.path.exists(pathlib.Path(self.output_directory,".temp",f'temp_{subject_id}.pkl').as_posix()):
                 print(f"{subject_id} already processed")
                 return
-
             print("Processing:", subject_id)
 
             read_results = self.input(subject_id)
@@ -428,6 +427,9 @@ def process_one_subject(self, subject_id):
                     if hasattr(read_results[i], "metadata") and read_results[i].metadata is not None:
                         metadata.update(read_results[i].metadata)
 
+                    if self.is_nnunet or self.is_nnunet_inference:
+                        subject_id = f"{subject_id.split('_')[1]}_{subject_id.split('_')[0]:03}"
+
                     # modality is MR and the user has selected to have nnunet output
                     if self.is_nnunet:
                         if modality == "MR":  # MR images can have various modalities like FLAIR, T1, etc.

From 7d76c05ee95c874bd2a6076de8a271b2b511f376 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Thu, 28 Nov 2024 08:59:36 -0500
Subject: [PATCH 07/14] style: added comments and updated
 markdown_report_images function for readability

---
 src/imgtools/autopipeline.py |  9 ++++++--
 src/imgtools/utils/nnunet.py | 40 +++++++++++++++++++++---------------
 2 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/src/imgtools/autopipeline.py b/src/imgtools/autopipeline.py
index 7f506fc5..a4312a3b 100644
--- a/src/imgtools/autopipeline.py
+++ b/src/imgtools/autopipeline.py
@@ -427,7 +427,7 @@ def process_one_subject(self, subject_id):
                     if hasattr(read_results[i], "metadata") and read_results[i].metadata is not None:
                         metadata.update(read_results[i].metadata)
 
-                    if self.is_nnunet or self.is_nnunet_inference:
+                    if self.is_nnunet or self.is_nnunet_inference: # Going from {SUBJECT_NUM}_{SUBJECT_NAME} -> {SUBJECT_NAME}_{SUBJECT_NUM}
                         subject_id = f"{subject_id.split('_')[1]}_{subject_id.split('_')[0]:03}"
 
                     # modality is MR and the user has selected to have nnunet output
@@ -638,7 +638,12 @@ def save_data(self):
                 num_training_cases=len(self.train)               
             )
             create_train_script(self.output_directory, self.dataset_id)
-            markdown_report_images(self.output_directory, self.total_modality_counter)
+            markdown_report_images(
+                self.output_directory, 
+                self.total_modality_counter, 
+                len(self.train), 
+                len(self.test)
+            )
         
         # Save summary info (factor into different file)
         markdown_path = pathlib.Path(self.output_directory, "report.md").as_posix()
diff --git a/src/imgtools/utils/nnunet.py b/src/imgtools/utils/nnunet.py
index 6f96ec58..56a41a82 100644
--- a/src/imgtools/utils/nnunet.py
+++ b/src/imgtools/utils/nnunet.py
@@ -1,26 +1,34 @@
 from typing import Tuple, List
-import os
-import pathlib
-import glob
-import json
-import numpy as np
+import os, pathlib, glob, json
 import matplotlib.pyplot as plt
 
+def markdown_report_images(output_folder, modality_count, train_total, test_total):
+    output_folder = pathlib.Path(output_folder)
+    images_folder = output_folder / "markdown_images"
 
-def markdown_report_images(output_folder, modality_count):
+    images_folder.mkdir(parents=True, exist_ok=True)
+
+    # Bar plot for modality counts
     modalities = list(modality_count.keys())
     modality_totals = list(modality_count.values())
-    if not os.path.exists(pathlib.Path(output_folder, "markdown_images").as_posix()):
-        os.makedirs(pathlib.Path(output_folder, "markdown_images").as_posix())
-    plt.figure(1)
+    plt.figure()  
     plt.bar(modalities, modality_totals)
-    plt.savefig(pathlib.Path(output_folder, "markdown_images", "nnunet_modality_count.png").as_posix())
-
-    plt.figure(2)
-    train_total = len(glob.glob(pathlib.Path(output_folder, "labelsTr", "*.nii.gz").as_posix()))
-    test_total = len(glob.glob(pathlib.Path(output_folder, "labelsTs", "*.nii.gz").as_posix()))
-    plt.pie([train_total, test_total], labels=[f"Train - {train_total}", f"Test - {test_total}"])
-    plt.savefig(pathlib.Path(output_folder, "markdown_images", "nnunet_train_test_pie.png").as_posix())
+    plt.title("Modality Counts")
+    plt.xlabel("Modalities")
+    plt.ylabel("Counts")
+    plt.savefig(images_folder / "nnunet_modality_count.png")
+    plt.close()  
+
+    # Pie chart for train/test distribution
+    plt.figure()
+    plt.pie(
+        [train_total, test_total],
+        labels=[f"Train - {train_total}", f"Test - {test_total}"],
+        autopct='%1.1f%%', 
+    )
+    plt.title("Train/Test Distribution")
+    plt.savefig(images_folder / "nnunet_train_test_pie.png")
+    plt.close()
 
 
 def save_json(obj, file: str, indent: int = 4, sort_keys: bool = True) -> None:

From e9cf057814b6dae1d22774abc4f614a0eba9d332 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Thu, 28 Nov 2024 09:35:53 -0500
Subject: [PATCH 08/14] fix: using nnunetv2 functions in train.sh and updated
 string splicing for subject_id

---
 src/imgtools/autopipeline.py | 8 +++++---
 src/imgtools/utils/nnunet.py | 4 ++--
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/imgtools/autopipeline.py b/src/imgtools/autopipeline.py
index a4312a3b..af473160 100644
--- a/src/imgtools/autopipeline.py
+++ b/src/imgtools/autopipeline.py
@@ -430,6 +430,8 @@ def process_one_subject(self, subject_id):
                     if self.is_nnunet or self.is_nnunet_inference: # Going from {SUBJECT_NUM}_{SUBJECT_NAME} -> {SUBJECT_NAME}_{SUBJECT_NUM}
                         subject_id = f"{subject_id.split('_')[1]}_{subject_id.split('_')[0]:03}"
 
+                    subject_name = "_".join(subject_id.split("_")[:-1]) # Extracts {SUBJECT_NAME}
+
                     # modality is MR and the user has selected to have nnunet output
                     if self.is_nnunet:
                         if modality == "MR":  # MR images can have various modalities like FLAIR, T1, etc.
@@ -446,7 +448,7 @@ def process_one_subject(self, subject_id):
                                 self.total_modality_counter[modality] = 1
                             else:
                                 self.total_modality_counter[modality] += 1
-                        if "_".join(subject_id.split("_")[1::]) in self.train:
+                        if subject_name in self.train:
                             self.output(subject_id, image, output_stream, nnunet_info=self.nnunet_info)
                         else:
                             self.output(subject_id, image, output_stream, nnunet_info=self.nnunet_info, train_or_test="Ts")
@@ -556,7 +558,7 @@ def process_one_subject(self, subject_id):
                         sparse_mask = np.transpose(mask.generate_sparse_mask().mask_array)
                         sparse_mask = sitk.GetImageFromArray(sparse_mask)  # convert the nparray to sitk image
                         sparse_mask.CopyInformation(image)
-                        if "_".join(subject_id.split("_")[1::]) in self.train:
+                        if subject_name in self.train:
                             self.output(subject_id, sparse_mask, output_stream, nnunet_info=self.nnunet_info, label_or_image="labels")  # rtstruct is label for nnunet
                         else:
                             self.output(subject_id, sparse_mask, output_stream, nnunet_info=self.nnunet_info, label_or_image="labels", train_or_test="Ts")
@@ -595,7 +597,7 @@ def process_one_subject(self, subject_id):
             metadata["Modalities"] = str(list(subject_modalities))
             metadata["numRTSTRUCTs"] = num_rtstructs
             if self.is_nnunet:
-                metadata["Train or Test"] = "train" if "_".join(subject_id.split("_")[1::]) in self.train else "test"
+                metadata["Train or Test"] = "train" if subject_name in self.train else "test"
             with open(pathlib.Path(self.output_directory,".temp",f'{subject_id}.pkl').as_posix(),'wb') as f:  # the continue flag depends on this being the last line in this method
                 pickle.dump(metadata,f)
             return 
diff --git a/src/imgtools/utils/nnunet.py b/src/imgtools/utils/nnunet.py
index 56a41a82..cf69887b 100644
--- a/src/imgtools/utils/nnunet.py
+++ b/src/imgtools/utils/nnunet.py
@@ -61,11 +61,11 @@ def create_train_script(output_directory, dataset_id):
 export nnUNet_preprocessed="{base_dir}/nnUNet_preprocessed"
 export nnUNet_results="{base_dir}/nnUNet_trained_models"
 
-nnUNet_plan_and_preprocess -t {dataset_id} --verify_dataset_integrity
+nnUNetv2_plan_and_preprocess -d {dataset_id} --verify_dataset_integrity -c 3d_fullres
 
 for (( i=0; i<5; i++ ))
 do
-    nnUNet_train 3d_fullres nnUNetTrainerV2 {output_directory.name} $i --npz
+    nnUNetv2_train {dataset_id} 3d_fullres $i
 done
 """
 

From dfa1ce174cf2d3d4c1825862c80473543c499c4e Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Thu, 28 Nov 2024 12:48:35 -0500
Subject: [PATCH 09/14] docs: updated nnunet page to reflect v2 changes

---
 docs/nnUNet.md | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/docs/nnUNet.md b/docs/nnUNet.md
index f9471f77..cebf2493 100644
--- a/docs/nnUNet.md
+++ b/docs/nnUNet.md
@@ -40,28 +40,28 @@ The generated output directory structure will look something like:
 ```sh
 OUTPUT_DIRECTORY
 ├── nnUNet_preprocessed
-├── nnUNet_raw_data_base
-│   └── nnUNet_raw_data
-│       └── Task500_HNSCC
-│           ├── nnunet_preprocess_and_train.sh
-│           └── ...
-└── nnUNet_trained_models
+├── nnUNet_raw
+│   └── Dataset001_HNSCC
+│       ├── nnunet_preprocess_and_train.sh
+│       └── ...
+└── nnUNet_results
+
 ```
 
 nnUNet requires that environment variables be set before any commands are executed. To temporarily set them, run the following:
 
 ```sh
-export nnUNet_raw_data_base="/OUTPUT_DIRECTORY/nnUNet_raw_data_base"
+export nnUNet_raw="/OUTPUT_DIRECTORY/nnUNet_raw"
 export nnUNet_preprocessed="/OUTPUT_DIRECTORY/nnUNet_preprocessed"
-export RESULTS_FOLDER="/OUTPUT_DIRECTORY/nnUNet_trained_models"
+export nnUNet_results=="/OUTPUT_DIRECTORY/nnUNet_results"
 ```
 
-To permanently set these environment variables, make sure that in your `~/.bashrc` file, these environment variables are set for nnUNet. The `nnUNet_preprocessed` and `nnUNet_trained_models` folders are generated as empty folders for you by Med-ImageTools. `nnUNet_raw_data_base` is populated with the required raw data files. Add this to the file:
+To permanently set these environment variables, make sure that in your `~/.bashrc` file, these environment variables are set for nnUNet. The `nnUNet_preprocessed` and `nnUNet_results` folders are generated as empty folders for you by Med-ImageTools. `nnUNet_raw` is populated with the required raw data files. Add this to the file:
 
 ```sh
-export nnUNet_raw_data_base="/OUTPUT_DIRECTORY/nnUNet_raw_data_base"
+export nnUNet_raw="/OUTPUT_DIRECTORY/nnUNet_raw"
 export nnUNet_preprocessed="/OUTPUT_DIRECTORY/nnUNet_preprocessed"
-export RESULTS_FOLDER="/OUTPUT_DIRECTORY/nnUNet_trained_models"
+export nnUNet_results=="/OUTPUT_DIRECTORY/nnUNet_results"
 ```
 
 Then, execute the command:
@@ -70,18 +70,18 @@ Then, execute the command:
 source ~/.bashrc
 ```
 
-Too allow nnUNet to preprocess your data for training, run the following command. Set XXX to the ID that you want to preprocess. This is your task ID. For example, for Task500_HNSCC, the task ID is 500. Task IDs must be between 500 and 999, so Med-ImageTools can run 500 instances with the `--nnunet` flag in a single output folder.
+Too allow nnUNet to preprocess your data for training, run the following command. Set X to the ID that you want to preprocess. This is your dataset ID. For example, for Dataset001_HNSCC, the dataset ID is 1. Dataset IDs must be between 1 and 999, so Med-ImageTools can run 999 instances with the `--nnunet` flag in a single output folder.
 
 ```sh
-nnUNet_plan_and_preprocess -t XXX --verify_dataset_integrity
+nnUNetv2_plan_and_preprocess -d X --verify_dataset_integrity -c 3d_fullres
 ```
 
 ### nnUNet Training
 
-Once nnUNet has finished preprocessing, you may begin training your nnUNet model. To train your model, run the following command. Learn more about nnUNet's options here: <https://github.com/MIC-DKFZ/nnUNet#model-training>
+Once nnUNet has finished preprocessing, you may begin training your nnUNet model. To train your model, run the following command. Learn more about nnUNet's options here: <https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/how_to_use_nnunet.md>
 
 ```sh
-nnUNet_train CONFIGURATION TRAINER_CLASS_NAME TASK_NAME_OR_ID FOLD
+nnUNetv2_train nnUNetv2_train DATASET_NAME_OR_ID UNET_CONFIGURATION FOLD
 ```
 
 ## nnUNet Inference
@@ -104,14 +104,14 @@ The directory structue will look like:
 
 ```sh
 OUTPUT_DIRECTORY
-├── 0_subject1_0000.nii.gz
+├── subject1_000_0000.nii.gz
 └── ...
 ```
 
 To run inference, run the command:
 
 ```sh
-nnUNet_predict -i INPUT_FOLDER -o OUTPUT_FOLDER -t TASK_NAME_OR_ID -m CONFIGURATION
+nnUNetv2_predict -i INPUT_FOLDER -o OUTPUT_FOLDER -d DATASET_NAME_OR_ID -c CONFIGURATION
 ```
 
 In this case, the `INPUT_FOLDER` of nnUNet is the `OUTPUT_DIRECTORY` of Med-ImageTools.
\ No newline at end of file

From 3ed91bc55cdf345345c3bd08a3255f17cbef5221 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Thu, 28 Nov 2024 16:27:32 -0500
Subject: [PATCH 10/14] resolved pr comments

---
 docs/nnUNet.md               |  4 +-
 src/imgtools/autopipeline.py |  7 ++-
 src/imgtools/utils/nnunet.py | 98 ++++++++++++++++++------------------
 3 files changed, 54 insertions(+), 55 deletions(-)

diff --git a/docs/nnUNet.md b/docs/nnUNet.md
index cebf2493..87b3a305 100644
--- a/docs/nnUNet.md
+++ b/docs/nnUNet.md
@@ -53,7 +53,7 @@ nnUNet requires that environment variables be set before any commands are execut
 ```sh
 export nnUNet_raw="/OUTPUT_DIRECTORY/nnUNet_raw"
 export nnUNet_preprocessed="/OUTPUT_DIRECTORY/nnUNet_preprocessed"
-export nnUNet_results=="/OUTPUT_DIRECTORY/nnUNet_results"
+export nnUNet_results="/OUTPUT_DIRECTORY/nnUNet_results"
 ```
 
 To permanently set these environment variables, make sure that in your `~/.bashrc` file, these environment variables are set for nnUNet. The `nnUNet_preprocessed` and `nnUNet_results` folders are generated as empty folders for you by Med-ImageTools. `nnUNet_raw` is populated with the required raw data files. Add this to the file:
@@ -81,7 +81,7 @@ nnUNetv2_plan_and_preprocess -d X --verify_dataset_integrity -c 3d_fullres
 Once nnUNet has finished preprocessing, you may begin training your nnUNet model. To train your model, run the following command. Learn more about nnUNet's options here: <https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/how_to_use_nnunet.md>
 
 ```sh
-nnUNetv2_train nnUNetv2_train DATASET_NAME_OR_ID UNET_CONFIGURATION FOLD
+nnUNetv2_train DATASET_NAME_OR_ID UNET_CONFIGURATION FOLD
 ```
 
 ## nnUNet Inference
diff --git a/src/imgtools/autopipeline.py b/src/imgtools/autopipeline.py
index af473160..ab094e6e 100644
--- a/src/imgtools/autopipeline.py
+++ b/src/imgtools/autopipeline.py
@@ -182,7 +182,7 @@ def __init__(self,
                 all_ids = set(range(1, 1000))
                 available_ids = sorted(all_ids - used_ids)
                 if not available_ids:
-                    raise Error(
+                    raise ValueError(
                         "There are not enough dataset IDs for the nnUNet output. "
                         "Please ensure at least one dataset ID is available between 001 and 999, inclusive."
                     )
@@ -197,8 +197,7 @@ def __init__(self,
             self.output_directory = pathlib.Path(self.output_directory, dataset_folder_name).as_posix()
 
             temp_folder_path = pathlib.Path(self.output_directory, ".temp")
-            if not temp_folder_path.exists():
-                os.makedirs(temp_folder_path.as_posix())
+            temp_folder_path.mkdir(parents=True, exist_ok=True)
         
         if not dry_run:
             # Make a directory
@@ -319,7 +318,7 @@ def __init__(self,
                 raise FileNotFoundError(f"No file named {dataset_json_path} found. Image modality definitions are required for nnUNet inference")
             else:
                 with open(dataset_json_path, "r") as f:
-                    self.nnunet_info["modalities"] = {v: k.zfill(4) for k, v in json.load(f)["modality"].items()}
+                    self.nnunet_info["modalities"] = {v: k.zfill(4) for k, v in json.load(f)["channel_names"].items()}
 
         # Input operations
         self.input = ImageAutoInput(input_directory, modalities, n_jobs, visualize, update)
diff --git a/src/imgtools/utils/nnunet.py b/src/imgtools/utils/nnunet.py
index cf69887b..4e373caf 100644
--- a/src/imgtools/utils/nnunet.py
+++ b/src/imgtools/utils/nnunet.py
@@ -1,8 +1,12 @@
-from typing import Tuple, List
-import os, pathlib, glob, json
+from typing import Tuple, Dict
+import pathlib, json
 import matplotlib.pyplot as plt
 
-def markdown_report_images(output_folder, modality_count, train_total, test_total):
+def markdown_report_images(
+    output_folder: str | pathlib.Path, 
+    modality_count: Dict[str, int], 
+    train_total: int, 
+    test_total: int) -> None:
     output_folder = pathlib.Path(output_folder)
     images_folder = output_folder / "markdown_images"
 
@@ -31,13 +35,19 @@ def markdown_report_images(output_folder, modality_count, train_total, test_tota
     plt.close()
 
 
-def save_json(obj, file: str, indent: int = 4, sort_keys: bool = True) -> None:
+def save_json(
+    obj: str,
+    file: str | pathlib.Path,  
+    indent: int = 4, 
+    sort_keys: bool = True) -> None:
     with open(file, 'w') as f:
         json.dump(obj, f, sort_keys=sort_keys, indent=indent)
 
-def create_train_script(output_directory, dataset_id):
+def create_train_script(
+        output_directory: str | pathlib.Path,
+        dataset_id: int):
     """
-    Creates a bash script (`train.sh`) for running nnUNet training, with paths for raw data,
+    Creates a bash script (`nnunet_preprocess_and_train.sh`) for running nnUNet training, with paths for raw data,
     preprocessed data, and trained models. The script ensures environment variables are set and 
     executes the necessary training commands.
 
@@ -47,7 +57,7 @@ def create_train_script(output_directory, dataset_id):
     """
     # Define paths using pathlib
     output_directory = pathlib.Path(output_directory)
-    shell_path = output_directory / 'train.sh'
+    shell_path = output_directory / 'nnunet_preprocess_and_train.sh'
     base_dir = output_directory.parent.parent
 
     if shell_path.exists():
@@ -59,7 +69,7 @@ def create_train_script(output_directory, dataset_id):
 
 export nnUNet_raw="{base_dir}/nnUNet_raw"
 export nnUNet_preprocessed="{base_dir}/nnUNet_preprocessed"
-export nnUNet_results="{base_dir}/nnUNet_trained_models"
+export nnUNet_results="{base_dir}/nnUNet_results"
 
 nnUNetv2_plan_and_preprocess -d {dataset_id} --verify_dataset_integrity -c 3d_fullres
 
@@ -70,20 +80,24 @@ def create_train_script(output_directory, dataset_id):
 """
 
     # Write the script content to the file
-    with open(shell_path, "w", newline="\n") as f:
+    with shell_path.open("w", newline="\n") as f:
         f.write(script_content)
 
 # Code take from: https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunetv2/dataset_conversion/generate_dataset_json.py
 
 def generate_dataset_json(output_folder: str,
-                          channel_names: dict,
-                          labels: dict,
-                          num_training_cases,
+                          channel_names: Dict[str, str],
+                          labels: Dict[str, int],
+                          num_training_cases: int,
                           file_ending: str,
                           regions_class_order: Tuple[int, ...] = None,
-                          dataset_name: str = None, reference: str = None, release: str = None, license: str = 'hands off!',
+                          dataset_name: str = None, 
+                          reference: str = None, 
+                          release: str = None, 
+                          usage_license: str = 'hands off!',
                           description: str = None,
-                          overwrite_image_reader_writer: str = None, **kwargs):
+                          overwrite_image_reader_writer: str = None, 
+                          **kwargs):
     """
     Generates a dataset.json file in the output folder
 
@@ -130,47 +144,33 @@ def generate_dataset_json(output_folder: str,
 
     has_regions: bool = any([isinstance(i, (tuple, list)) and len(i) > 1 for i in labels.values()])
     if has_regions:
-        assert regions_class_order is not None, f"You have defined regions but regions_class_order is not set. " \
-                                                f"You need that."
-    # channel names need strings as keys
-    keys = list(channel_names.keys())
-    for k in keys:
-        if not isinstance(k, str):
-            channel_names[str(k)] = channel_names[k]
-            del channel_names[k]
-
-    # labels need ints as values
-    for l in labels.keys():
-        value = labels[l]
-        if isinstance(value, (tuple, list)):
-            value = tuple([int(i) for i in value])
-            labels[l] = value
-        else:
-            labels[l] = int(labels[l])
+        assert regions_class_order is not None, "You have defined regions but regions_class_order is not set. " \
+                                                "You need that."
 
     dataset_json = {
-        'channel_names': channel_names,  # previously this was called 'modality'. I didn't like this so this is
-        # channel_names now. Live with it.
+        'channel_names': channel_names, 
         'labels': labels,
         'numTraining': num_training_cases,
         'file_ending': file_ending,
     }
 
-    if dataset_name is not None:
-        dataset_json['name'] = dataset_name
-    if reference is not None:
-        dataset_json['reference'] = reference
-    if release is not None:
-        dataset_json['release'] = release
-    if license is not None:
-        dataset_json['licence'] = license
-    if description is not None:
-        dataset_json['description'] = description
-    if overwrite_image_reader_writer is not None:
-        dataset_json['overwrite_image_reader_writer'] = overwrite_image_reader_writer
-    if regions_class_order is not None:
-        dataset_json['regions_class_order'] = regions_class_order
-
-    dataset_json.update(kwargs)
+    # Construct the dataset JSON structure  
+    dataset_json = {  
+        "channel_names": channel_names,  
+        "labels": labels,  
+        "numTraining": num_training_cases,  
+        "file_ending": file_ending,  
+        "name": dataset_name,  
+        "reference": reference,  
+        "release": release,  
+        "licence": usage_license,  
+        "description": description,  
+        "overwrite_image_reader_writer": overwrite_image_reader_writer,  
+        "regions_class_order": regions_class_order,  
+    }   
+
+    dataset_json = {k: v for k, v in dataset_json.items() if v is not None}  
+
+    dataset_json.update(kwargs) 
 
     save_json(dataset_json, pathlib.Path(output_folder) / 'dataset.json', sort_keys=False)
\ No newline at end of file

From 1a0552dec10e3a699651da7f607b44e4273e8ec7 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Thu, 28 Nov 2024 16:34:51 -0500
Subject: [PATCH 11/14] fix: small bugs

---
 src/imgtools/utils/nnunet.py | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/src/imgtools/utils/nnunet.py b/src/imgtools/utils/nnunet.py
index 4e373caf..9606df3f 100644
--- a/src/imgtools/utils/nnunet.py
+++ b/src/imgtools/utils/nnunet.py
@@ -36,7 +36,7 @@ def markdown_report_images(
 
 
 def save_json(
-    obj: str,
+    obj: dict,
     file: str | pathlib.Path,  
     indent: int = 4, 
     sort_keys: bool = True) -> None:
@@ -147,13 +147,6 @@ def generate_dataset_json(output_folder: str,
         assert regions_class_order is not None, "You have defined regions but regions_class_order is not set. " \
                                                 "You need that."
 
-    dataset_json = {
-        'channel_names': channel_names, 
-        'labels': labels,
-        'numTraining': num_training_cases,
-        'file_ending': file_ending,
-    }
-
     # Construct the dataset JSON structure  
     dataset_json = {  
         "channel_names": channel_names,  

From 615a232fe9f19fa1380dfa2ed1ca66578b034d61 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Fri, 29 Nov 2024 14:53:30 -0500
Subject: [PATCH 12/14] fix: update to file name and taking care of patients
 that have no ROI whe --ignore_missing_regex

---
 src/imgtools/autopipeline.py | 37 +++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/src/imgtools/autopipeline.py b/src/imgtools/autopipeline.py
index ab094e6e..eba7d8b2 100644
--- a/src/imgtools/autopipeline.py
+++ b/src/imgtools/autopipeline.py
@@ -426,10 +426,10 @@ def process_one_subject(self, subject_id):
                     if hasattr(read_results[i], "metadata") and read_results[i].metadata is not None:
                         metadata.update(read_results[i].metadata)
 
-                    if self.is_nnunet or self.is_nnunet_inference: # Going from {SUBJECT_NUM}_{SUBJECT_NAME} -> {SUBJECT_NAME}_{SUBJECT_NUM}
-                        subject_id = f"{subject_id.split('_')[1]}_{subject_id.split('_')[0]:03}"
+                    if self.is_nnunet or self.is_nnunet_inference: 
+                        nnunet_subject_name = f"{pathlib.Path(self.input_directory).name}_{subject_id.split('_')[0]:>03}"
 
-                    subject_name = "_".join(subject_id.split("_")[:-1]) # Extracts {SUBJECT_NAME}
+                    subject_name = "_".join(subject_id.split("_")[1::]) # Extracts {SUBJECT_NAME}
 
                     # modality is MR and the user has selected to have nnunet output
                     if self.is_nnunet:
@@ -448,14 +448,14 @@ def process_one_subject(self, subject_id):
                             else:
                                 self.total_modality_counter[modality] += 1
                         if subject_name in self.train:
-                            self.output(subject_id, image, output_stream, nnunet_info=self.nnunet_info)
+                            self.output(nnunet_subject_name, image, output_stream, nnunet_info=self.nnunet_info)
                         else:
-                            self.output(subject_id, image, output_stream, nnunet_info=self.nnunet_info, train_or_test="Ts")
+                            self.output(nnunet_subject_name, image, output_stream, nnunet_info=self.nnunet_info, train_or_test="Ts")
                     elif self.is_nnunet_inference:
                         self.nnunet_info["current_modality"] = modality if modality == "CT" else metadata["AcquisitionContrast"]
                         if self.nnunet_info["current_modality"] not in self.nnunet_info["modalities"].keys():
                             raise ValueError(f"The modality {self.nnunet_info['current_modality']} is not in the list of modalities that are present in dataset.json.")
-                        self.output(subject_id, image, output_stream, nnunet_info=self.nnunet_info)
+                        self.output(nnunet_subject_name, image, output_stream, nnunet_info=self.nnunet_info)
                     else:
                         self.output(subject_id, image, output_stream)
 
@@ -536,7 +536,7 @@ def process_one_subject(self, subject_id):
                                 all_files = glob.glob(pathlib.Path(image_train_path, "*.nii.gz").as_posix())
                                 # print(all_files)
                                 for file in all_files:
-                                    if subject_id in os.path.split(file)[1]:
+                                    if nnunet_subject_name in os.path.split(file)[1]:
                                         os.remove(file)
                             warnings.warn(f"Patient {subject_id} is missing a complete image-label pair")
                             self.patients_with_missing_labels.add("".join(subject_id.split("_")[1:]))
@@ -558,9 +558,9 @@ def process_one_subject(self, subject_id):
                         sparse_mask = sitk.GetImageFromArray(sparse_mask)  # convert the nparray to sitk image
                         sparse_mask.CopyInformation(image)
                         if subject_name in self.train:
-                            self.output(subject_id, sparse_mask, output_stream, nnunet_info=self.nnunet_info, label_or_image="labels")  # rtstruct is label for nnunet
+                            self.output(nnunet_subject_name, sparse_mask, output_stream, nnunet_info=self.nnunet_info, label_or_image="labels")  # rtstruct is label for nnunet
                         else:
-                            self.output(subject_id, sparse_mask, output_stream, nnunet_info=self.nnunet_info, label_or_image="labels", train_or_test="Ts")
+                            self.output(nnunet_subject_name, sparse_mask, output_stream, nnunet_info=self.nnunet_info, label_or_image="labels", train_or_test="Ts")
                     else:
                         # if there is only one ROI, sitk.GetArrayFromImage() will return a 3d array instead of a 4d array with one slice
                         if len(mask_arr.shape) == 3:
@@ -627,23 +627,34 @@ def save_data(self):
         shutil.rmtree(pathlib.Path(self.output_directory, ".temp").as_posix())
 
         if self.is_nnunet: 
+            train_dir = ((pathlib.Path(self.output_directory)) / 'imagesTr')
+            num_training_cases = sum( # This can be different from len(self.train) if regex of ROI not matched
+                1 for file in train_dir.iterdir() 
+                if file.suffixes == ['.nii', '.gz']
+                )
+            test_dir = ((pathlib.Path(self.output_directory)) / 'imagesTs')
+            num_test_cases = sum( # This can be different from len(self.test) if regex of ROI not matched
+                1 for file in test_dir.iterdir() 
+                if file.suffixes == ['.nii', '.gz']
+                ) if test_dir.exists() else 0 # no testing data
+            
             channel_names_mapping = { # Earlier generated as {"CT": ""0000"} now needed as {"0": "CT"}
                 self.nnunet_info["modalities"][k].lstrip('0') or '0': k  
                 for k in self.nnunet_info["modalities"].keys()
             }
             generate_dataset_json(
-                output_folder=pathlib.Path(self.output_directory).as_posix(), 
+                output_folder=pathlib.Path(self.output_directory), 
                 channel_names=channel_names_mapping,
                 labels=self.existing_roi_indices,     
                 file_ending='.nii.gz',
-                num_training_cases=len(self.train)               
+                num_training_cases=num_training_cases
             )
             create_train_script(self.output_directory, self.dataset_id)
             markdown_report_images(
                 self.output_directory, 
                 self.total_modality_counter, 
-                len(self.train), 
-                len(self.test)
+                num_training_cases, 
+                num_test_cases
             )
         
         # Save summary info (factor into different file)

From d3e1a11686d187a1bd91e1746b5c87c58cba6dd6 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Mon, 2 Dec 2024 10:15:23 -0500
Subject: [PATCH 13/14] docs: update to nnunet stuff on main page and minor
 edit on nnunet page

---
 docs/AutoPipeline.md | 35 +++++++++++++++++++++++------------
 docs/nnUNet.md       |  6 +++---
 2 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/docs/AutoPipeline.md b/docs/AutoPipeline.md
index 835ee021..c4f3e365 100644
--- a/docs/AutoPipeline.md
+++ b/docs/AutoPipeline.md
@@ -174,14 +174,18 @@ The contours can be selected by creating a YAML file to define a regular express
     ```sh
     OUTPUT_DIRECTORY
     ├── nnUNet_preprocessed
-    ├── nnUNet_raw_data_base
-    │   └── nnUNet_raw_data
-    │       └── Task500_HNSCC
-    │           ├── imagesTr
-    │           ├── imagesTs
-    │           ├── labelsTr
-    │           └── labelsTs
-    └── nnUNet_trained_models
+    ├── nnUNet_raw
+    │   └── Dataset001_HNSCC
+    │       ├── dataset.csv
+    │       ├── dataset.json
+    │       ├── imagesTr
+    │       ├── imagesTs
+    │       ├── labelsTr
+    │       ├── labelsTs
+    │       ├── markdown_images
+    │       ├── nnunet_preprocess_and_train.sh
+    │       └── report.md
+    └── nnUNet_results
     ```
 
 2. **Training Size**
@@ -231,7 +235,7 @@ The contours can be selected by creating a YAML file to define a regular express
 
     ```sh
     OUTPUT_DIRECTORY
-    ├── 0_subject1_0000.nii.gz
+    ├── {DATASET}_{SUBJECT_NUM}_{MODALITY}.nii.gz
     └── ...
     ```
 
@@ -246,8 +250,15 @@ The contours can be selected by creating a YAML file to define a regular express
     A dataset json file may look like this:
     ```json
     {
-        "modality":{
-            "0": "CT"
-        }
+    "channel_names": { 
+        "0": "CT"
+    },
+    "labels": {
+        "background": 0,
+        "GTV": 1
+    },
+    "numTraining": 5,
+    "file_ending": ".nii.gz",
+    "licence": "hands off!"
     }
     ```
diff --git a/docs/nnUNet.md b/docs/nnUNet.md
index 87b3a305..1bb03f1c 100644
--- a/docs/nnUNet.md
+++ b/docs/nnUNet.md
@@ -103,9 +103,9 @@ Modalities can also be set to `--modalities MR`.
 The directory structue will look like:
 
 ```sh
-OUTPUT_DIRECTORY
-├── subject1_000_0000.nii.gz
-└── ...
+    OUTPUT_DIRECTORY
+    ├── {DATASET}_{SUBJECT_NUM}_{MODALITY}.nii.gz
+    └── ...
 ```
 
 To run inference, run the command:

From caf5da502afeef3c866a7fc32d20ee72c74dae67 Mon Sep 17 00:00:00 2001
From: Joshua Siraj <joshua.siraj@ryerson.ca>
Date: Mon, 16 Dec 2024 15:09:28 -0500
Subject: [PATCH 14/14] refactor: small type update + gitignore updated to
 ignore test outs

---
 .gitignore                   | 1 +
 src/imgtools/utils/nnunet.py | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index 5d6e73e1..866e942b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,6 +4,7 @@ examples/autotest.py
 temp_outputs
 .ruff_cache
 *built_with*
+nnunet_out*
 
 #vscode files
 /.idea
diff --git a/src/imgtools/utils/nnunet.py b/src/imgtools/utils/nnunet.py
index 9606df3f..389082be 100644
--- a/src/imgtools/utils/nnunet.py
+++ b/src/imgtools/utils/nnunet.py
@@ -85,7 +85,7 @@ def create_train_script(
 
 # Code take from: https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunetv2/dataset_conversion/generate_dataset_json.py
 
-def generate_dataset_json(output_folder: str,
+def generate_dataset_json(output_folder: pathlib.Path | str,
                           channel_names: Dict[str, str],
                           labels: Dict[str, int],
                           num_training_cases: int,