build: Release v1.3.0

* feat: Add automatic batch size and safe testing (#68) * feat: Add automatic batch size lightning callback, is disabled by default * feat: Automatically skip callbacks if there's a disable parameter set to true * feat: Implement automatic batch computation for sklearn based tasks * feat: Add safe test execution with batch scaling for inference tasks * fix: Fix wrong datamodule initialization * build: Bump version 1.2.6 -> 1.3.0 * feat: Add automatic batch size callback in anomalib callbacks config * docs: Update changelog * docs: Add information about automatic batch size for training * feat: Add improvement over lightning callback to fix training stage issue and allow selecting application stages Approved By: @rcmalli * test: Improve test speed, fix tasks crashing when no checkpoint is provided Reduce test time (#69) * test: Start setting up training mocking * build: Add pytest-mock to requirements * build: Add --mock-training flag to test pipeline * fix: Fix export crash when no checkpoint is provided * fix: Fix breaking task when no checkpoint is available * test: Fix train mock for patchcore and efficient_ad, reduce test dataset dimension * test: Add mock training possibility for segmentation tests, reduce default test datasets * fix: Fix wrong datamodule initialization * test: Add mock training fixture for classification training, remove test with run_test flag set to false * test: Reduce the number of patches for patch training * test: Add mock training fixture to multilabel classification * feat: Add safe test execution with batch scaling for inference tasks * fix: Fix number of threads for torch not set properly, set onnx threads for export * docs: Update changelog * test: Mark csflow test as slow * test: Mark draem test as slow * build: Allow installation using python 3.10, deprecate 3.8 * build: Upgrade minimum requirement to python 3.9, fix packages for 3.10 installation * refactor: Wrong indentation * docs: Add python 3.10 information in readme * test: Run automations using python 3.10 * test: Run automatic tests on both python 3.9 and 3.10 Approved By: @rcmalli * fix: Fix missing string marks * feat: Use multiclass datamodule for segmentation generic example (#73) Update oxford pet segmentation example to multiclass segmentation task (#73) * feat: update oxford segmentation example * fix: update parameter name for the model * feat: update analysis logs Approved-By: @lorenzomammana --------- Co-authored-by: Refik Can Malli <[email protected]>
orobix · Oct 9, 2023 · 418373f · 418373f
2 parents d473e25 + 5677cb3
commit 418373f
Show file tree

Hide file tree

Showing 35 changed files with 526 additions and 122 deletions.
diff --git a/.github/workflows/docs.yaml b/.github/workflows/docs.yaml
@@ -17,7 +17,7 @@ jobs:
       - name: Setup Python
         uses: actions/setup-python@v4
         with:
-          python-version: 3.9
+          python-version: "3.10"
 
       - name: Install Dependencies
         run: |

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -21,7 +21,7 @@ jobs:
       fail-fast: false
       matrix:
         os: [ubuntu-latest]
-        python-version: ["3.9"]
+        python-version: ["3.9", "3.10"]
     timeout-minutes: 60
     steps:
       - uses: actions/checkout@v3
@@ -41,4 +41,4 @@ jobs:
 
       - name: Run Tests
         run: |
-          python -m pytest -v --disable-pytest-warnings --strict-markers --color=yes -m "not slow"
+          python -m pytest -v --disable-pytest-warnings --strict-markers --mock-training --color=yes -m "not slow"
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,28 @@
 # Changelog
 All notable changes to this project will be documented in this file.
 
+### [1.3.0]
+
+#### Added
+
+- Add batch_size_finder callback for lightning based models (disabled by default).
+- Add automatic_batch_size parameter to sklearn based training tasks (disabled by default).
+- Add automatic_batch_size decorator to automatically fix the batch size of test functions for evaluation tasks if any out of memory error occurs.
+- Add --mock-training flag for tests to skip running the actual training and just run the test.
+
+#### Fixed
+
+- Fix lightning based tasks not working properly when no checkpoint was provided.
+- Fix list and dict config not handled properly as input_shapes parameter.
+
+#### Updated 
+
+- Greatly reduce the dimension of test datasets to improve testing speed.
+
+#### Updated
+
+- Make `disable` a quadra reserved keyword for all callbacks, to disable a callback just set it to `disable: true` in the configuration file.
+
 ### [1.2.7]
 
 #### Fixed

diff --git a/README.md b/README.md
@@ -46,7 +46,7 @@ ______________________________________________________________________
 
 ## Quick Start Guide
 
-Currently we support installing from source since the library is not yet available on `PyPI` and currently supported Python version is `3.9`.
+Currently we support installing from source since the library is not yet available on `PyPI` and currently supported Python versions are `3.9` and `3.10`.
 
 ```shell
 pip install git+https://github.com/orobix/quadra.git
@@ -59,7 +59,7 @@ If you don't have virtual environment ready, Let's set up our environment for us
 Create and activate a new `Conda` environment. 
 
 ```shell
-conda create -n myenv python=3.9
+conda create -n myenv python=3.10
 conda activate myenv
 ```
 

diff --git a/docs/tutorials/examples/anomaly_detection.md b/docs/tutorials/examples/anomaly_detection.md
@@ -126,8 +126,20 @@ callbacks:
     disable: true
     plot_only_wrong: false
     plot_raw_outputs: false
+  batch_size_finder:
+    _target_: quadra.callbacks.lightning.BatchSizeFinder
+    mode: power
+    steps_per_trial: 3
+    init_val: 2
+    max_trials: 5 # Max 64
+    batch_arg_name: train_batch_size
+    disable: true
 ```
 
+!!! warning
+
+    By default lightning batch_size_finder callback is disabled. This callback will automatically try to infer the maximum batch size that can be used for training without running out of memory. We've experimented runtime errors with this callback on some machines due to a Pytorch/CUDNN incompatibility so be careful when using it.
+
 The min_max_normalization callback is used to normalize the anomaly maps to the range [0, 1] such that the threshold will become 0.5. 
 
 The threshold_type can be either "image" or "pixel" and it indicates which threshold to use to normalize the pixel level threshold, if no masks are available for segmentation this should always be "image", otherwise the normalization will use the threshold computed without masks which would result in wrong segmentations.

diff --git a/docs/tutorials/examples/segmentation.md b/docs/tutorials/examples/segmentation.md
@@ -167,7 +167,7 @@ export:
 
 backbone:
   model:
-    classes: 4 # The total number of classes (background + foreground)
+    num_classes: 4 # The total number of classes (background + foreground)
 
 task:
   run_test: true # run test after training is completed

diff --git a/docs/tutorials/examples/sklearn_classification.md b/docs/tutorials/examples/sklearn_classification.md
@@ -147,6 +147,9 @@ datamodule:
 
 task:
   device: cuda:0
+  automatic_batch_size:
+    starting_batch_size: 1024
+    disable: true
   output:
     folder: classification_experiment
     report: true
@@ -157,6 +160,8 @@ task:
 This will train a logistic regression classifier using a resnet18 backbone, resizing the images to 224x224 and using a 5-fold cross validation. The `class_to_idx` parameter is used to map the class names to indexes, the indexes will be used to train the classifier. The `output` parameter is used to specify the output folder and the type of output to save. The `export.types` parameter can be used to export the model in different formats, at the moment `torchscript`, `onnx` and `pytorch` are supported.
 The backbone (in torchscript and pytorch format) will be saved along with the classifier. `test_full_data` is used to specify if a final test should be performed on all the data (after training on the training and validation datasets).
 
+Optionally it's possible to enable the automatic batch size finder by setting `automatic_batch_size.disable` to `false`. This will try to find the maximum batch size that can be used on the given device without running out of memory. The `starting_batch_size` parameter is used to specify the starting batch size to use for the search, the algorithm will start from this value and will try to divide it by two until it doesn't run out of memory.
+
 ### Run
 
 Assuming that you have created a virtual environment and installed the `quadra` library, you can run the experiment by running the following command:

diff --git a/docs/tutorials/examples/sklearn_patch_classification.md b/docs/tutorials/examples/sklearn_patch_classification.md
@@ -223,6 +223,9 @@ datamodule:
 
 task:
   device: cuda:2
+  automatic_batch_size:
+    starting_batch_size: 1024
+    disable: true
   output:
     folder: classification_patch_experiment
     report: true

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "quadra"
-version = "1.2.7"
+version = "1.3.0"
 description = "Deep Learning experiment orchestration library"
 authors = [
   { name = "Alessandro Polidori", email = "[email protected]" },
@@ -16,7 +16,7 @@ authors = [
 keywords = ["deep learning", "experiment", "lightning", "hydra-core"]
 license = { file = "LICENSE" }
 readme = { file = "README.md", content-type = "text/markdown" }
-requires-python = ">=3.8,<3.10"
+requires-python = ">=3.9,<3.11"
 classifiers = [
   "Programming Language :: Python :: 3",
   "Intended Audience :: Developers",
@@ -52,6 +52,7 @@ dependencies = [
   "python-dotenv==0.21.*",
   "rich==13.2.*",
   "scikit-learn==1.2.*",
+  "pydantic==1.10.10",
   "grad-cam==1.4.6",
   "matplotlib==3.6.*",
   "seaborn==0.12.*",
@@ -62,13 +63,13 @@ dependencies = [
   "tripy==1.0.*",
   "h5py==3.8.*",
   "timm==0.6.12", # required by smp
-  "segmentation-models-pytorch==0.3.*",
-  "anomalib@git+https://github.com/orobix/[email protected]+obx.1.2.1",
+  "segmentation-models-pytorch==0.3.2",
+  "anomalib@git+https://github.com/orobix/[email protected]+obx.1.2.3",
   "xxhash==3.2.*",
 ]
 
 [project.optional-dependencies]
-test = ["pytest==7.2.*", "pytest-cov==4.0.*", "pytest-lazy-fixture==0.6.*"]
+test = ["pytest==7.2.*", "pytest-cov==4.0.*", "pytest-lazy-fixture==0.6.*", "pytest-mock==3.11.*"]
 
 dev = [
   "interrogate==1.5.*",
@@ -118,7 +119,7 @@ repository = "https://github.com/orobix/quadra"
 
 # Adapted from https://realpython.com/pypi-publish-python-package/#version-your-package
 [tool.bumpver]
-current_version = "1.2.7"
+current_version = "1.3.0"
 version_pattern = "MAJOR.MINOR.PATCH"
 commit_message = "build: Bump version {old_version} -> {new_version}"
 commit = true
@@ -193,6 +194,7 @@ ignore_regex = [
   ".*on_train.*",
   ".*on_validation.*",
   ".*on_test.*",
+  ".*on_predict.*",
   ".*forward.*",
   ".*backward.*",
   ".*training_step.*",

diff --git a/quadra/__init__.py b/quadra/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "1.2.7"
+__version__ = "1.3.0"
 
 
 def get_version():

diff --git a/quadra/callbacks/lightning.py b/quadra/callbacks/lightning.py
@@ -1,6 +1,8 @@
 import pytorch_lightning as pl
 from pytorch_lightning.callbacks import Callback
+from pytorch_lightning.callbacks.batch_size_finder import BatchSizeFinder as LightningBatchSizeFinder
 from pytorch_lightning.utilities import rank_zero_only
+from torch import nn
 
 from quadra.utils.utils import get_logger
 
@@ -35,3 +37,77 @@ def on_fit_start(self, trainer: pl.Trainer, pl_module: pl.LightningModule) -> No
                     self.log_every_n_steps,
                     len_train_dataloader,
                 )
+
+
+class BatchSizeFinder(LightningBatchSizeFinder):
+    """Batch size finder setting the proper model training status as the current one from lightning seems bugged.
+    It also allows to skip some batch size finding steps.
+
+    Args:
+        find_train_batch_size: Whether to find the training batch size.
+        find_validation_batch_size: Whether to find the validation batch size.
+        find_test_batch_size: Whether to find the test batch size.
+        find_predict_batch_size: Whether to find the predict batch size.
+        mode: The mode to use for batch size finding. See `pytorch_lightning.callbacks.BatchSizeFinder` for more
+            details.
+        steps_per_trial: The number of steps per trial. See `pytorch_lightning.callbacks.BatchSizeFinder` for more
+            details.
+        init_val: The initial value for batch size. See `pytorch_lightning.callbacks.BatchSizeFinder` for more details.
+        max_trials: The maximum number of trials. See `pytorch_lightning.callbacks.BatchSizeFinder` for more details.
+        batch_arg_name: The name of the batch size argument. See `pytorch_lightning.callbacks.BatchSizeFinder` for more
+            details.
+    """
+
+    def __init__(
+        self,
+        find_train_batch_size: bool = True,
+        find_validation_batch_size: bool = False,
+        find_test_batch_size: bool = False,
+        find_predict_batch_size: bool = False,
+        mode: str = "power",
+        steps_per_trial: int = 3,
+        init_val: int = 2,
+        max_trials: int = 25,
+        batch_arg_name: str = "batch_size",
+    ) -> None:
+        super().__init__(mode, steps_per_trial, init_val, max_trials, batch_arg_name)
+        self.find_train_batch_size = find_train_batch_size
+        self.find_validation_batch_size = find_validation_batch_size
+        self.find_test_batch_size = find_test_batch_size
+        self.find_predict_batch_size = find_predict_batch_size
+
+    def on_train_start(self, trainer: pl.Trainer, pl_module: pl.LightningModule) -> None:
+        if not self.find_train_batch_size:
+            return None
+
+        if not isinstance(pl_module.model, nn.Module):
+            raise ValueError("The model must be a nn.Module")
+        pl_module.model.train()
+        return super().on_train_epoch_start(trainer, pl_module)
+
+    def on_validation_start(self, trainer: pl.Trainer, pl_module: pl.LightningModule) -> None:
+        if not self.find_validation_batch_size:
+            return None
+
+        if not isinstance(pl_module.model, nn.Module):
+            raise ValueError("The model must be a nn.Module")
+        pl_module.model.eval()
+        return super().on_validation_epoch_start(trainer, pl_module)
+
+    def on_test_start(self, trainer: pl.Trainer, pl_module: pl.LightningModule) -> None:
+        if not self.find_test_batch_size:
+            return None
+
+        if not isinstance(pl_module.model, nn.Module):
+            raise ValueError("The model must be a nn.Module")
+        pl_module.model.eval()
+        return super().on_test_epoch_start(trainer, pl_module)
+
+    def on_predict_start(self, trainer: pl.Trainer, pl_module: pl.LightningModule) -> None:
+        if not self.find_predict_batch_size:
+            return None
+
+        if not isinstance(pl_module.model, nn.Module):
+            raise ValueError("The model must be a nn.Module")
+        pl_module.model.eval()
+        return super().on_predict_epoch_start(trainer, pl_module)
diff --git a/quadra/configs/callbacks/default.yaml b/quadra/configs/callbacks/default.yaml
@@ -20,5 +20,18 @@ progress_bar:
 lightning_trainer_setup:
   _target_: quadra.callbacks.lightning.LightningTrainerBaseSetup
   log_every_n_steps: 1
+
+batch_size_finder:
+  _target_: quadra.callbacks.lightning.BatchSizeFinder
+  mode: power
+  steps_per_trial: 3
+  init_val: 2
+  max_trials: 5 # Max 64
+  batch_arg_name: batch_size
+  disable: true
+  find_train_batch_size: true
+  find_validation_batch_size: false
+  find_test_batch_size: false
+  find_predict_batch_size: false
 #gpu_stats: TODO: This is not working with the current PL version
 #  _target_: nvitop.callbacks.lightning.GpuStatsLogger
diff --git a/quadra/configs/callbacks/default_anomalib.yaml b/quadra/configs/callbacks/default_anomalib.yaml
@@ -55,5 +55,13 @@ progress_bar:
 lightning_trainer_setup:
   _target_: quadra.callbacks.lightning.LightningTrainerBaseSetup
   log_every_n_steps: 1
+batch_size_finder:
+  _target_: quadra.callbacks.lightning.BatchSizeFinder
+  mode: power
+  steps_per_trial: 3
+  init_val: 2
+  max_trials: 5 # Max 64
+  batch_arg_name: train_batch_size
+  disable: true
 #gpu_stats: TODO: This is not working with the current PL version
 #  _target_: nvitop.callbacks.lightning.GpuStatsLogger
diff --git a/quadra/configs/datamodule/generic/oxford_pet/segmentation/base.yaml b/quadra/configs/datamodule/generic/oxford_pet/segmentation/base.yaml
@@ -1,4 +1,6 @@
 _target_: quadra.datamodules.generic.oxford_pet.OxfordPetSegmentationDataModule
+idx_to_class:
+  1: cat_or_dog
 data_path: ${oc.env:HOME}/.quadra/datasets/oxford-pet
 test_size: 0.2
 val_size: 0.2

diff --git a/quadra/configs/experiment/generic/oxford_pet/segmentation/smp.yaml b/quadra/configs/experiment/generic/oxford_pet/segmentation/smp.yaml
@@ -2,12 +2,18 @@
 defaults:
   - base/segmentation/smp # use smp file as default
   - override /datamodule: generic/oxford_pet/segmentation/base # update datamodule
+  - override /loss: smp_dice_multiclass
+  - override /model: smp_multiclass
   - _self_ # use this file as final config
 
 trainer:
   devices: [0]
   max_epochs: 10
 
+backbone:
+  model:
+    num_classes: 2 # The total number of classes (background + foreground)
+
 task:
   report: true
   evaluate:

diff --git a/quadra/configs/task/sklearn_classification.yaml b/quadra/configs/task/sklearn_classification.yaml
@@ -1,7 +1,10 @@
 _target_: quadra.tasks.SklearnClassification
-device: "cuda:0"
+device: cuda:0
+automatic_batch_size:
+  starting_batch_size: 1024
+  disable: true
 output:
-  folder: "classification_experiment"
+  folder: classification_experiment
   report: true
   example: true
   test_full_data: true
diff --git a/quadra/configs/task/sklearn_classification_patch.yaml b/quadra/configs/task/sklearn_classification_patch.yaml
@@ -1,5 +1,8 @@
 _target_: quadra.tasks.PatchSklearnClassification
-device: cuda:2
+device: cuda:0
+automatic_batch_size:
+  starting_batch_size: 1024
+  disable: true
 output:
   folder: classification_patch_experiment
   report: true