[FTT-NAS doc] amend ftt-nas doc, configs. Small fix of the rram patch.

walkerning · Sep 1, 2020 · 8126860 · 8126860
1 parent d9d0ba2
commit 8126860
Show file tree

Hide file tree

Showing 8 changed files with 192 additions and 64 deletions.
diff --git a/examples/research/ftt-nas/README.md b/examples/research/ftt-nas/README.md
@@ -10,12 +10,45 @@ If you find this work/repo helpful, please cite:
 }
 ```
 
-#### feature fault model (MiBB)
+All experiments are conducted with 8-bit quantization. We use a patch-based quantization library [nics_fix_pytorch](https://github.com/walkerning/nics_fix_pytorch), you can install the compatible version of `nics_fix_pytorch` by:
+```
+pip install git+git://github.com/walkerning/nics_fix_pytorch.git@9b97b9402521577cf40910ba4f18c790abe5319f
+```
+
+Note that since the quantization is simulated, it makes the search and final process much slower.
+
+### feature fault model (MiBB)
+The quantization patch of MiBB model is in `examples/research/ftt-nas/fixed_point_plugins/fixed_point_patch_new.py`. And the following scripts will copy this patch into the plugin directory of `aw_nas`. This patch will quantize all the weights before each call of `forward` or `forward_one_step_callback`. And the every feature map is quantized in `aw_nas.objective.fault_injection:FaultInjectionObjective.inject`, which will be called by `forward_one_step_callback`.
 
-`bash ./examples/research/ftt-nas/run_mibb.sh [exp name] ./examples/research/ftt-nas/mibb.yaml --load_state_dict {state_dict}`
+MiBB fault injection is conducted in `aw_nas.objective.fault_injection.FaultInjectionObjective.inject`.
+
+*Feel free to ignore this note* An unimportant note is that, there is another patch named `fixed_point_patch.py` that patches `nn.Conv2d` and `nn.Linear` modules directly. During our experiments, we find the previous patch method in `fixed_point_patch_new.py` is faster (see the comments in the patch), thus we use the `fixed_point_patch_new.py` patch.
+
+#### Search
+```
+FIXED=1 bash ./examples/research/ftt-nas/run_mibb_search.sh ./examples/research/ftt-nas/mibb.yaml
+```
+Use `GPU=1 ...` to run on different GPUs.
+
+#### Final training
+`bash ./examples/research/ftt-nas/run_mibb.sh [exp name] [final config] --load_state_dict {state_dict}`
 `(option) GPU=x seed=x fixed=x(0/1)`
 
-#### weight fault model (adSAF)
+### weight fault model (adSAF)
+
+Different from the MiBB model, the quantization and fault injection under adSAF fault model are all conducted in the `fixed_point_plugins/fixed_point_rram_patch*.py` patches. These two patches are slightly different.
+* The `fixed_point_plugins/fixed_point_rram_patch_all.py` patch adds differently-shifted biases onto the weights. This is only an approximation of bit-stucks in RRAM cells.
+* The `fixed_point_plugins/fixed_point_rram_patch_bit.py` patch employs bitwise operations, which corresponds better to the hardware faults.
+
+The experiments in the paper are conducted with the `_all.py` patch.
+
+#### Search
+```
+bash ./examples/research/ftt-nas/run_adsaf_search.sh ./examples/research/ftt-nas/adsaf.yaml
+```
+
+#### Final training
+`bash ./examples/research/ftt-nas/run_adsaf.sh adsaf_final ./examples/research/ftt-nas/adsaf_final.yaml`
+You can add optional environment variables such as `GPU=x seed=x`. Optionally, can add other arguments such as `--load_state_dict {state_dict}`
 
-`bash ./examples/research/ftt-nas/run_adsaf.sh [exp name] ./examples/research/ftt-nas/adsaf.yaml --load_state_dict {state_dict}`
-`(option) GPU=x seed=x`
+Because the FTT-NAS experiments are conducted using the commit `27d1aeb4121c320ed11361b705`, I have adapted the `adsaf_final.yaml` configuration to the current master `d9d0ba26870b009778f2209f22fde876c0e55aa2`. But I'm not sure whether there are other sutble changes that would make the results differ. If you find that you cannot reproduce the results with the latest code, you can contact us by email or issue.
diff --git a/examples/research/ftt-nas/adsaf_final.yaml b/examples/research/ftt-nas/adsaf_final.yaml
@@ -0,0 +1,92 @@
+## ---- Component search_space ----
+# ---- Type cnn ----
+search_space_type: cnn
+search_space_cfg:
+  # Schedulable attributes:
+  # cell_layout:  [0, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3]
+  cell_layout: null
+  num_cell_groups: 2
+  num_init_nodes: 2
+  num_layers: 8
+  num_node_inputs: 2
+  num_steps: 4
+  reduce_cell_groups:
+  - 1
+  shared_primitives: # this is not used
+  - none
+  - max_pool_3x3
+  - avg_pool_3x3
+  - skip_connect
+  - sep_conv_3x3
+  - sep_conv_5x5
+  - dil_conv_3x3
+  - dil_conv_5x5
+# ---- End Type cnn ----
+## ---- End Component search_space ----
+
+## ---- Component dataset ----
+# ---- Type cifar10 ----
+dataset_type: cifar10
+dataset_cfg:
+  # Schedulable attributes: 
+  cutout: null
+# ---- End Type cifar10 ----
+## ---- End Component dataset ----
+
+## ---- Component final_model ----
+# ---- Type cnn_genotype ----
+final_model_type: cnn_patch_final_model
+final_model_cfg:
+  # Schedulable attributes: dropout_path_rate
+  genotypes: "normal_0=[('relu_conv_bn_3x3', 0, 2), ('relu_conv_bn_5x5', 0, 2), ('conv_1x1', 1, 3), ('dil_conv_3x3', 1, 3), ('sep_conv_3x3', 1, 4), ('sep_conv_3x3', 0, 4), ('max_pool_3x3', 2, 5), ('relu_conv_bn_5x5', 1, 5)], reduce_1=[('skip_connect', 0, 2), ('relu_conv_bn_5x5', 1, 2), ('dil_conv_3x3', 2, 3), ('sep_conv_3x3', 0, 3), ('max_pool_3x3', 1, 4), ('conv_1x1', 3, 4), ('dil_conv_3x3', 4, 5), ('skip_connect', 2, 5)]"
+  auxiliary_cfg: null
+  auxiliary_head: false
+  dropout_path_rate: 0.0
+  dropout_rate: 0.0
+  init_channels: 20
+  num_classes: 10
+  cell_use_preprocess: true
+  cell_preprocess_stride: relu_conv_bn_3x3
+  cell_preprocess_normal: relu_conv_bn_3x3
+  # modified due to interface change of cnn_final_model
+  # preprocess_op_type: relu_conv_bn_3x3
+  schedule_cfg: null
+  stem_multiplier: 3
+# ---- End Type cnn_genotype ----
+## ---- End Component final_model ----
+
+objective_type: saf_injection
+objective_cfg:
+  as_evaluator_regularization: true
+  as_controller_regularization: true
+  inject_prob: 0.08
+  fault_loss_coeff: 0.7
+  fault_reward_coeff: 0.2
+  latency_reward_coeff: 0
+  activation_fixed_bitwidth: 8
+
+## ---- Component final_trainer ----
+# ---- Type cnn_trainer ----
+final_trainer_type: cnn_trainer
+final_trainer_cfg:
+  # Schedulable attributes: 
+  auxiliary_head: false
+  auxiliary_weight: 0.0
+  add_regularization: true
+  batch_size: 128
+  epochs: 100
+  grad_clip: 5.0
+  learning_rate: 0.1
+  momentum: 0.9
+  no_bias_decay: false
+  optimizer_type: SGD
+  optimizer_scheduler:
+    type: MultiStepLR
+    milestones: [40, 80]
+    gamma: 0.1
+  schedule_cfg: null
+  warmup_epochs: 0
+  weight_decay: 0.00004
+  save_as_state_dict: true
+# ---- End Type cnn_trainer ----
+## ---- End Component final_trainer ----
diff --git a/examples/research/ftt-nas/fixed_point_plugins/fixed_point_rram_patch_all.py b/examples/research/ftt-nas/fixed_point_plugins/fixed_point_rram_patch_all.py
@@ -1,7 +1,6 @@
 """
 Script for patching fixed point modules.
 """
-import six
 import numpy as np
 import torch
 from torch import nn
@@ -140,39 +139,12 @@ def __init__(self, *args, **kwargs):
 class CNNGenotypeModelPatch(CNNGenotypeModel):
     NAME = "cnn_patch_final_model"
 
-    SCHEDULABLE_ATTRS = ["dropout_path_rate"]
-
-    def __init__(self, search_space, device, genotypes,
-                 num_classes=10, init_channels=36, layer_channels=tuple(), stem_multiplier=3,
-                 dropout_rate=0.1, dropout_path_rate=0.2,
-                 auxiliary_head=False, auxiliary_cfg=None,
-                 use_stem="conv_bn_3x3", stem_stride=1, stem_affine=True,
-                 cell_use_preprocess=True, preprocess_op_type=None,
-                 cell_pool_batchnorm=False, cell_group_kwargs=None,
-                 cell_independent_conn=False,
-                 schedule_cfg=None):
-        super(CNNGenotypeModelPatch, self).__init__(search_space, device, genotypes,
-                 num_classes, init_channels, layer_channels, stem_multiplier, dropout_rate,
-                 dropout_path_rate, auxiliary_head, auxiliary_cfg, 
-                 use_stem, stem_stride, stem_affine,
-                 preprocess_op_type, cell_use_preprocess, cell_pool_batchnorm, cell_group_kwargs,
-                 cell_independent_conn, schedule_cfg)
-
     def set_saf_ratio(self, ratio):
         for idx, _module in self.named_modules():
             if isinstance(_module, nn.Conv2d):
                 _module.set_saf_ratio(ratio)
 
 class SubCandidateNetPatch(SubCandidateNet):
-    """
-    The candidate net for SuperNet weights manager.
-    """
-
-    def __init__(self, super_net, rollout, member_mask, gpus=tuple(), cache_named_members=False,
-                 virtual_parameter_only=True, eval_no_grad=True):
-        super(SubCandidateNetPatch, self).__init__(super_net, rollout, member_mask, gpus, cache_named_members,
-                 virtual_parameter_only, eval_no_grad)
-
     def set_saf_ratio(self, ratio):
         for idx, _module in self.named_modules():
             if isinstance(_module, nn.Conv2d):

diff --git a/examples/research/ftt-nas/fixed_point_plugins/fixed_point_rram_patch_bit.py b/examples/research/ftt-nas/fixed_point_plugins/fixed_point_rram_patch_bit.py
@@ -185,39 +185,12 @@ def __init__(self, *args, **kwargs):
 class CNNGenotypeModelPatch(CNNGenotypeModel):
     NAME = "cnn_patch_final_model"
 
-    SCHEDULABLE_ATTRS = ["dropout_path_rate"]
-
-    def __init__(self, search_space, device, genotypes,
-                 num_classes=10, init_channels=36, layer_channels=tuple(), stem_multiplier=3,
-                 dropout_rate=0.1, dropout_path_rate=0.2,
-                 auxiliary_head=False, auxiliary_cfg=None,
-                 use_stem="conv_bn_3x3", stem_stride=1, stem_affine=True,
-                 cell_use_preprocess=True, preprocess_op_type=None,
-                 cell_pool_batchnorm=False, cell_group_kwargs=None,
-                 cell_independent_conn=False,
-                 schedule_cfg=None):
-        super(CNNGenotypeModelPatch, self).__init__(search_space, device, genotypes,
-                 num_classes, init_channels, layer_channels, stem_multiplier, dropout_rate,
-                 dropout_path_rate, auxiliary_head, auxiliary_cfg, 
-                 use_stem, stem_stride, stem_affine,
-                 cell_use_preprocess, preprocess_op_type, cell_pool_batchnorm, cell_group_kwargs,
-                 cell_independent_conn, schedule_cfg)
-
     def set_saf_ratio(self, ratio):
         for idx, _module in self.named_modules():
             if isinstance(_module, nn.Conv2d):
                 _module.set_saf_ratio(ratio)
 
 class SubCandidateNetPatch(SubCandidateNet):
-    """
-    The candidate net for SuperNet weights manager.
-    """
-
-    def __init__(self, super_net, rollout, member_mask, gpus=tuple(), cache_named_members=False,
-                 virtual_parameter_only=True, eval_no_grad=True):
-        super(SubCandidateNetPatch, self).__init__(super_net, rollout, member_mask, gpus, cache_named_members,
-                 virtual_parameter_only, eval_no_grad)
-
     def set_saf_ratio(self, ratio):
         for idx, _module in self.named_modules():
             if isinstance(_module, nn.Conv2d):

diff --git a/examples/research/ftt-nas/run_adsaf.sh b/examples/research/ftt-nas/run_adsaf.sh
@@ -1,8 +1,9 @@
 #!/bin/bash
+here=$(dirname "$0")
 gpu=${GPU:-0}
 weight_fault=1
 exp_name=${1}
-result_dir=./results/ftt_nas_adsaf/$exp_name
+result_dir=$here/results/ftt_nas_adsaf/$exp_name
 seed=${seed:-123}
 
 if [[ $weight_fault -gt 0 ]]; then
@@ -13,7 +14,8 @@ if [[ $weight_fault -gt 0 ]]; then
     if [[ ! -e $result_dir/awnas/data ]]; then
         ln -s ~/awnas/data $result_dir/awnas/data
     fi
-    cp ./examples/research/ftt-nas/fixed_point_plugins/fixed_point_rram_patch_bit.py $result_dir/awnas/plugins/
+    # cp $here/fixed_point_plugins/fixed_point_rram_patch_bit.py $result_dir/awnas/plugins/
+    cp $here/fixed_point_plugins/fixed_point_rram_patch_all.py $result_dir/awnas/plugins/
 fi
 config=${2}
 shift 2

diff --git a/examples/research/ftt-nas/run_adsaf_search.sh b/examples/research/ftt-nas/run_adsaf_search.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+set -e
+
+here=$(dirname "$0")
+weight_fault=1
+gpu=${GPU:-0}
+cfg_file=${1}
+default_exp_name=$(basename ${cfg_file})
+default_exp_name=${default_exp_name%.yaml}
+exp_name=${2:-${default_exp_name}}
+result_dir=$here/results/ftt_nas_adsaf_search/$exp_name
+
+if [[ $weight_fault -gt 0 ]]; then
+    echo "$result_dir/awnas/plugins"
+    if [[ -d "$result_dir/awnas/plugins" ]]; then
+	rm -r $result_dir/awnas/plugins
+    fi
+    mkdir -p $result_dir/awnas/plugins
+    if [[ ! -e $result_dir/awnas/data ]]; then
+	ln -s ~/awnas/data $result_dir/awnas/data
+    fi
+    cp $here/fixed_point_plugins/fixed_point_rram_patch_bit.py $result_dir/awnas/plugins/
+fi
+
+AWNAS_HOME=$result_dir/awnas/ awnas search --gpu $gpu --train-dir $result_dir/train --vis-dir results/tensorboard_new/weights/$exp_name/ --save-every 10 $cfg_file --develop
diff --git a/examples/research/ftt-nas/run_mibb.sh b/examples/research/ftt-nas/run_mibb.sh
@@ -1,8 +1,9 @@
 #!/bin/bash
+here=$(dirname "$0")
 fixed=${FIXED:-1}
 gpu=${GPU:-0}
 exp_name=${1}
-result_dir=./results/ftt_nas_mibb/$exp_name
+result_dir=$here/results/ftt_nas_mibb/$exp_name
 seed=${seed:-123}
 
 echo "use plugin dir: $result_dir/awnas/plugins"
@@ -16,7 +17,7 @@ fi
 
 if [[ $fixed -gt 0 ]]; then
     echo "copy fixed patch to plugin dir $result_dir/awnas/plugins/"
-    cp ./examples/research/ftt-nas/fixed_point_plugins/fixed_point_patch_new.py $result_dir/awnas/plugins/
+    cp $here/fixed_point_plugins/fixed_point_patch_new.py $result_dir/awnas/plugins/
 fi
 config=${2}
 shift 2

diff --git a/examples/research/ftt-nas/run_mibb_search.sh b/examples/research/ftt-nas/run_mibb_search.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+set -e
+
+here=$(dirname "$0")
+fixed=${FIXED:-1}
+gpu=${GPU:-0}
+cfg_file=${1}
+default_exp_name=$(basename ${cfg_file})
+default_exp_name=${default_exp_name%.yaml}
+exp_name=${2:-${default_exp_name}}
+result_dir=$here/results/ftt_nas_mibb_search/$exp_name
+addi_args=${ADDI_ARGS:-""}
+
+echo "use plugin dir: $result_dir/awnas/plugins"
+if [[ -d "$result_dir/awnas/plugins" ]]; then
+    rm -r $result_dir/awnas/plugins
+fi
+mkdir -p $result_dir/awnas/plugins
+if [[ ! -e $result_dir/awnas/data ]]; then
+    ln -s $HOME/awnas/data $result_dir/awnas/data
+fi
+
+if [[ $fixed -gt 0 ]]; then
+    echo "copy fixed patch to plugin dir $result_dir/awnas/plugins/"
+    cp $here/fixed_point_plugins/fixed_point_patch_new.py $result_dir/awnas/plugins/
+fi
+# For profiling only
+# AWNAS_HOME=$result_dir/awnas/ python -m cProfile awnas search --gpu $gpu --train-dir $result_dir/train --vis-dir results/tensorboard/ftt_search_tcad/$exp_name/ --save-every 10 ${2} --develop ${addi_args}
+
+AWNAS_HOME=$result_dir/awnas/ awnas search --gpu $gpu --train-dir $result_dir/train --vis-dir $result_dir/tensorboard --save-every 10 ${cfg_file} --develop ${addi_args}