Feature/export onnx models (#40)

* build: Add optional onnx dependencies * feat: Add onnx export function * refactor: Start logger refactoring * feat: Add onnx export capabilities for classification * feat: Add onnx export capability * feat: Add evaluation model wrappers * feat!: Refactor export_config parameter to become its own config to avoid code replication * refactor: Refactor export function to avoid excessive code replication * feat: Add inference configuration * build: Upgrade anomalib version * tests: Refactor anomaly tests to perform training and inference with onnx and torchscript * tests: Improve tests for classification related tasks * tests: Update segmentation tests with onnx export * tests: Update ssl tests to integrate onnx export * tests: Add tests to validate the outputs of exported models * build: Add pytest lazy fixtures package to test requirements * tests: Add tests checking the equality of exported models outputs * tests: Add guards to run tests if onnx is not installed, add onnx installation to github tests * style: Fix wrong parentheses * fix: Fix wrong usage of pytest skipif * fix: Fix missing parameter pop in export onnx function * tests: Remove onnx export from fastflow test * fix: Allow model wrapper to retrieve input shapes if instance is a torchscript model * build: Bump version 1.1.3 -> 1.1.4 * docs: Update changelog * refactor: Tiny improvements to model export * refactor: Add dictionary mapping export types and paths to model export function return values * fix: Fix defaults order * refactor: Move get_export_extension function * feat: Use iobinding to handle torch inputs for onnx * feat: Add cpu method to evaluation models * fix: Fix wrong configuration parameter * fix: Fix segmentation analysis not working due to missing parameter * feat: add gpu unit tests * docs: Add documentation for model import and export * docs: Add export information in documentation * docs: Update changelog * refactor: Remove references to save_backbone parameter * docs: Update changelog * docs: Fix wrong typing * fix: Avoid exporting ModelSignatureWrapper, fix wrong onnx export with multiple inputs * fix: Fix multiple inputs not handled properly in onnx evaluation forward * feat: Add automatic export with strict=False if normal torchscript fails * fix: Fix dynamic axes not generated properly when fixed_batch_size isn't passed to configuration --------- Approved By: @AlessandroPolidori Co-authored-by: rcmalli <[email protected]>
orobix · Sep 8, 2023 · cc81f05 · cc81f05
1 parent cea6f86
commit cc81f05
Show file tree

Hide file tree

Showing 61 changed files with 1,798 additions and 704 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -37,9 +37,8 @@ jobs:
       - name: Install Package
         run: |
           python -m pip install -U pip
-          python -m pip install -e ".[test]" --no-cache-dir
+          python -m pip install -e ".[test,onnx]" --no-cache-dir
 
       - name: Run Tests
         run: |
           python -m pytest -v --disable-pytest-warnings --strict-markers --color=yes -m "not slow"
- 
diff --git a/.gitignore b/.gitignore
@@ -68,3 +68,4 @@ docs/javascripts/images/*
 test-output.xml
 external/
 site/
+local/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,21 @@ All notable changes to this project will be documented in this file.
 #### Added
 
 - Add plot_raw_outputs feature to class VisualizerCallback in anomaly detection, to save the raw images of the segmentation and heatmap output.
+- Add support for onnx exportation of trained models.
+- Add support for onnx model import in all evaluation tasks.
+- Add `export` configuration group to regulate exportation parameters.
+- Add `inference` configuration group to regulate inference parameters.
+
+#### Changed 
+
+- Move `export_types` parameter from `task` configuration group to `export` configuration group under `types` parameter.
+- Refactor export model function to be more generic and be availble from the base task class.
+- Remove `save_backbone` parameter for scikit-learn based tasks.
+
+#### Fixed
+
+- Fix failures when trying to override `hydra` configuration groups due to wrong override order.
+
 ### [1.1.4]
 
 #### Fixed

diff --git a/Makefile b/Makefile
@@ -1,13 +1,16 @@
 # Makefile
 SHELL := /bin/bash
+DEVICE ?= cpu
 
 .PHONY: help
 help:
 	@echo "Commands:"
-	@echo "clean   		 : cleans all unnecessary files."
-	@echo "docs-serve    : serves the documentation."
-	@echo "docs-build    : builds the documentation."
-	@echo "style   		 : runs pre-commit."
+	@echo "clean   		       : cleans all unnecessary files."
+	@echo "docs-serve          : serves the documentation."
+	@echo "docs-build          : builds the documentation."
+	@echo "style   		       : runs pre-commit."
+	@echo "units-tests:  	   : runs unit tests."
+	@echo "integration-tests:  : runs integration tests."
 
 # Cleaning
 .PHONY: clean
@@ -27,8 +30,13 @@ style:
 	pre-commit run --all --verbose
 .PHONY: docs-build
 docs-build:
-	 mkdocs build -d ./site
+	mkdocs build -d ./site
 
 .PHONY: docs-serve
 docs-serve:
-	 mkdocs serve
+	mkdocs serve
+
+.PHONY: units-tests
+units-tests:
+	@python -m pytest -v --disable-pytest-warnings --strict-markers --color=yes --device $(DEVICE)
+
diff --git a/docs/tutorials/configurations.md b/docs/tutorials/configurations.md
@@ -160,6 +160,9 @@ defaults:
   - override /scheduler: rop
   - override /transforms: default_resize
 
+export:
+  types: [torchscript]
+
 datamodule:
   num_workers: 8
   batch_size: 32
@@ -178,10 +181,6 @@ task:
   report: True
   output:
     example: True
-  export_config:
-    types: [torchscript]
-    input_shapes: # Redefine the input shape if not automatically inferred
-
 
 core:
   tag: "run"

diff --git a/docs/tutorials/devices_setup.md b/docs/tutorials/devices_setup.md
@@ -38,13 +38,9 @@ _target_: quadra.tasks.SklearnClassification
 device: "cuda:0"
 output:
   folder: "classification_experiment"
-  save_backbone: false
   report: true
   example: true
   test_full_data: true
-export_config:
-  types: [torchscript]
-  input_shapes: # Redefine the input shape if not automatically inferred
 ```
 
 You can change the device to `cpu` or a different cuda device depending on your needs.

diff --git a/docs/tutorials/examples/anomaly_detection.md b/docs/tutorials/examples/anomaly_detection.md
@@ -184,14 +184,17 @@ As already mentioned anomaly detection requires just good images for training, t
 
 ### Experiment
 
-Suppose that we want to run the experiment on the given dataset using the PADIM technique. We can define take the generic padim config for mnist as an example found under `experiment/generic/mnist/anomaly/padim.yaml`.
+Suppose that we want to run the experiment on the given dataset using the PADIM technique. We can take the generic padim config for mnist as an example found under `experiment/generic/mnist/anomaly/padim.yaml`.
 
 ```yaml
 # @package _global_
 defaults:
   - base/anomaly/padim
   - override /datamodule: generic/mnist/anomaly/base
 
+export:
+  types: [torchscript]
+
 model:
   model:
     input_size: [224, 224]
@@ -226,7 +229,7 @@ trainer:
   check_val_every_n_epoch: ${trainer.max_epochs}
 ```
 
-We start from the base configuration for PADIM, then we override the datamodule to use the generic mnist datamodule. Using this configuration we specify that we want to use PADIM, extracting features using the resnet18 backbone with image size 224x224, the dataset is `mnist`, we specify that the task is taken from the anomalib configuration which specify it to be segmentation. One very important thing to watch out is the `check_val_every_n_epoch` parameter. This parameter should match the number of epochs for `PADIM` and `Patchcore`, the reason is that in the validation phase the model will be fitted and we want the fit to be done only once and on all the data, increasing the max_epoch is useful when we apply data augmentation, otherwise it doesn't make a lot of sense as we would fit the model on the same, replicated data.
+We start from the base configuration for PADIM, then we override the datamodule to use the generic mnist datamodule. Using this configuration we specify that we want to use PADIM, extracting features using the resnet18 backbone with image size 224x224, the dataset is `mnist`, we specify that the task is taken from the anomalib configuration which specify it to be segmentation. One very important thing to watch out is the `check_val_every_n_epoch` parameter. This parameter should match the number of epochs for `PADIM` and `Patchcore`, the reason is that in the validation phase the model will be fitted and we want the fit to be done only once and on all the data, increasing the max_epoch is useful when we apply data augmentation, otherwise it doesn't make a lot of sense as we would fit the model on the same, replicated data. The model will be exported at the end of the training phase, as we have specified the `export.types` parameter to `torchscript` the model will be exported only in torchscript format.
 
 ### Run
 
@@ -289,7 +292,7 @@ task:
 
 By default, the inference will recompute the threshold based on test data to maximize the F1-score, if you want to use the threshold from the training phase you can set the `use_training_threshold` parameter to true.
 
-The model path is the path to an exported model, at the moment only `torchscript` models are supported (exported automatically after a training experiment). Right now only the `CFLOW` model is not supported for inference as it's not compatible with torchscript.
+The model path is the path to an exported model, at the moment `torchscript` and `onnx` models are supported (exported automatically after a training experiment). Right now only the `CFLOW` model is not supported for inference as it's not compatible with botyh torchscript and onnx.
 
 An inference configuration using the mnist dataset is found under `configs/experiment/generic/mnist/anomaly/inference.yaml`.
 

diff --git a/docs/tutorials/examples/classification.md b/docs/tutorials/examples/classification.md
@@ -114,9 +114,6 @@ task:
   report: True
   output:
     example: True
-  export_config:
-    types: [pytorch, torchscript]
-    input_shapes: # Redefine the input shape if not automatically inferred
 
 
 core:
@@ -172,6 +169,9 @@ defaults:
   - override /backbone: vit16_tiny
   - _self_
 
+export:
+  types: [onnx, torchscript]
+
 datamodule:
   num_workers: 12
   batch_size: 32
@@ -187,9 +187,6 @@ task:
   report: True 
   output:
     example: True # Generate an example of concordants and discordants predictions for each class
-  export_config:
-    types: [pytorch, torchscript]
-    input_shapes: # Redefine the input shape if not automatically inferred
 
    
 model:
@@ -236,7 +233,7 @@ checkpoints           config_tree.txt  deployment_model  test
 config_resolved.yaml  data             main.log
 ```
 
-Where `checkpoints` contains the pytorch lightning checkpoints of the model, `data` contains the joblib dump of the datamodule with its parameters and dataset split, `deployment_model` contains the model in exported format (default is torchscript), `test` contains the test artifacts.
+Where `checkpoints` contains the pytorch lightning checkpoints of the model, `data` contains the joblib dump of the datamodule with its parameters and dataset split, `deployment_model` contains the model in exported format (in this case onnx and torchscript, but by default is only torchscript), `test` contains the test artifacts.
 
 ## Evaluation
 

diff --git a/docs/tutorials/examples/multilabel_classification.md b/docs/tutorials/examples/multilabel_classification.md
@@ -139,9 +139,6 @@ task:
   report: False 
   output:
     example: False 
-  export_config:
-    types: [torchscript]
-    input_shapes: # Redefine the input shape if not automatically inferred
 
   
 logger:

diff --git a/docs/tutorials/examples/segmentation.md b/docs/tutorials/examples/segmentation.md
@@ -162,6 +162,9 @@ defaults:
   - base/segmentation/smp_multiclass  # use smp file as default
   - _self_ # use this file as final config
 
+export:
+  types: [onnx, torchscript]
+
 backbone:
   model:
     classes: 4 # The total number of classes (background + foreground)
@@ -171,10 +174,6 @@ task:
   report: false # allows to generate reports
   evaluate: # custom evaluation toggles
     analysis: false # Perform in depth analysis
-  export_config:
-    types: [torchscript]
-    input_shapes: # Redefine the input shape if not automatically inferred
-
     
 datamodule:
   data_path: /path/to/the/dataset # change the path to the dataset
@@ -200,7 +199,7 @@ core:
     When defining the `idx_to_class` dictionary, the keys should be the class index and the values should be the class name. The class index starts from 1.
 
 
-In the final configuration experiment we have specified the path to the dataset, batch size, split files, GPU device, experiment name and toggled some evaluation options.
+In the final configuration experiment we have specified the path to the dataset, batch size, split files, GPU device, experiment name and toggled some evaluation options, moreover we have specified that we want to export the model to `onnx` and `torchscript` formats.
 
 By default data will be logged to `Mlflow`. If `Mlflow` is not available it's possible to configure a simple csv logger by adding an override to the file above:
 

diff --git a/docs/tutorials/examples/sklearn_classification.md b/docs/tutorials/examples/sklearn_classification.md
@@ -95,17 +95,14 @@ defaults:
   - override /trainer: sklearn_classification
   - override /datamodule: base/sklearn_classification
 
+export:
+  types: [pytorch, torchscript]
+  
 backbone:
   model:
     pretrained: true
     freeze: true
 
-task:
-  export_config:
-    types: [pytorch, torchscript]
-    input_shapes: # Redefine the input shape if not automatically inferred
-
-
 core:
   tag: "run"
   name: "sklearn-classification"
@@ -121,6 +118,7 @@ By default the experiment will use dino_vitb8 as backbone, resizing the images t
 It will also export the model in two formats, "torchscript" and "pytorch".
 
 An actual configuration file based on the above could be this one (suppose it's saved under `configs/experiment/custom_experiment/sklearn_classification.yaml`):
+
 ```yaml
 # @package _global_
 
@@ -132,6 +130,9 @@ defaults:
 core:
   name: experiment-name
 
+export:
+  types: [pytorch, torchscript]
+
 datamodule:
   data_path: path_to_dataset
   batch_size: 64
@@ -148,18 +149,13 @@ task:
   device: cuda:0
   output:
     folder: classification_experiment
-    save_backbone: true
     report: true
     example: true
     test_full_data: true
-  export_config:
-    types: [pytorch, torchscript]
-    input_shapes: # Redefine the input shape if not automatically inferred
-
 ```
 
-This will train a logistic regression classifier using a resnet18 backbone, resizing the images to 224x224 and using a 5-fold cross validation. The `class_to_idx` parameter is used to map the class names to indexes, the indexes will be used to train the classifier. The `output` parameter is used to specify the output folder and the type of output to save. The `export_config.types` parameter can be used to export the model in different formats, at the moment `torchscript` and `pytorch` are supported.
-Since `save_backbone` is set to true, the backbone (in torchscript format) will be saved along with the classifier. `test_full_data` is used to specify if a final test should be performed on all the data (after training on the training and validation datasets).
+This will train a logistic regression classifier using a resnet18 backbone, resizing the images to 224x224 and using a 5-fold cross validation. The `class_to_idx` parameter is used to map the class names to indexes, the indexes will be used to train the classifier. The `output` parameter is used to specify the output folder and the type of output to save. The `export.types` parameter can be used to export the model in different formats, at the moment `torchscript`, `onnx` and `pytorch` are supported.
+The backbone (in torchscript and pytorch format) will be saved along with the classifier. `test_full_data` is used to specify if a final test should be performed on all the data (after training on the training and validation datasets).
 
 ### Run
 
@@ -181,7 +177,7 @@ classification_experiment_2  config_tree.txt              test
 
 Each `classification_experiment_X` folder contains the metrics for the corresponding fold while the `classification_experiment` folder contains the metrics computed aggregating the results of all the folds.
 
-The `data` folder contains a joblib version of the datamodule containing parameters and splits for reproducibility. The `deployment_model` folder contains the backbone exported in torchscript format if `save_backbone` to true alongside the joblib version of trained classifier. The `test` folder contains the metrics for the final test on all the data after the model has been trained on both train and validation.
+The `data` folder contains a joblib version of the datamodule containing parameters and splits for reproducibility. The `deployment_model` folder contains the backbone exported in torchscript and pytorch format alongside the joblib version of trained classifier. The `test` folder contains the metrics for the final test on all the data after the model has been trained on both train and validation.
 
 ## Evaluation
 The same datamodule specified before can be used for inference by setting the `phase` parameter to `test`. 

diff --git a/docs/tutorials/examples/sklearn_patch_classification.md b/docs/tutorials/examples/sklearn_patch_classification.md
@@ -205,6 +205,9 @@ defaults:
   - override /backbone: resnet18
   - _self_
 
+export:
+  types: [torchscript]
+  
 core:
   name: experiment-name
 
@@ -222,14 +225,9 @@ task:
   device: cuda:2
   output:
     folder: classification_patch_experiment
-    save_backbone: false
     report: true
     example: true
     reconstruction_method: major_voting
-  export_config:
-    types: [torchscript]
-    input_shapes: # Redefine the input shape if not automatically inferred
-
 ```
 
 This will train a resnet18 model on the given dataset, using 256 as batch size and skipping the background class during training.
@@ -253,7 +251,7 @@ config_tree.txt                  main.log
 
 Inside the `classification_patch_experiment` folder you should find some report utilities computed over the validation dataset, like the confusion matrix. The `reconstruction_results.json` file contains the reconstruction metrics computed over the validation dataset in terms of covered defects, it will also contain the coordinates of the polygons extracted over predicted areas of the image with the same label.
 
-The `data` folder contains a joblib version of the datamodule containing parameters and splits for reproducibility. The `deployment_model` folder contains the backbone exported in torchscript format if `save_backbone` to true alongside the joblib version of trained classifier.
+The `data` folder contains a joblib version of the datamodule containing parameters and splits for reproducibility. The `deployment_model` folder contains the backbone exported in torchscript format alongside the joblib version of trained classifier.
 
 ## Evaluation
 The same datamodule specified before can be used for inference. 

diff --git a/docs/tutorials/examples/ssl.md b/docs/tutorials/examples/ssl.md
@@ -160,7 +160,7 @@ The output folder should contain the following entries:
 checkpoints  config_resolved.yaml  config_tree.txt  data  deployment_model  main.log
 ```
 
-The `checkpoints` folder contains the saved `pytorch` lightning checkpoints. The `data` folder contains a joblib version of the datamodule containing all parameters and dataset spits. The `deployment_model` folder contains the model ready for production in the format specified in the task `export_config.types` parameter (default `torchscript`). 
+The `checkpoints` folder contains the saved `pytorch` lightning checkpoints. The `data` folder contains a joblib version of the datamodule containing all parameters and dataset spits. The `deployment_model` folder contains the model ready for production in the format specified in the `export.types` parameter (default `torchscript`). 
 
 ### Run (Advanced) - Changing transformations
-Original file line number
+Diff line change
@@ Expand Up / @@ -68,3 +68,4 @@ docs/javascripts/images/* @@
     test-output.xml
     external/
     site/
+    local/