Merge pull request #140 from huawei-noah/zjj_release_1.7.0

release 1.7.0
huawei-noah · Sep 27, 2021 · 1717008 · 1717008
2 parents 0e0354e + 61602fd
commit 1717008
Show file tree

Hide file tree

Showing 302 changed files with 16,223 additions and 3,000 deletions.
diff --git a/README.cn.md b/README.cn.md
@@ -9,12 +9,13 @@
 
 ---
 
-**Vega ver1.6.1 发布**
+**Vega ver1.7.0 发布**
 
-- Bug Fixes
+- 特性增强
 
-  - 日志打印中的评估时间错误。
-  - 更新Record时错误更新了模型描述。
+  - 提供用于Ascend MindStudio的发布版本。
+  - 提供Horovod（GPU）和HCCL（NPU）的数据并行训练能力。
+  - 修复BUG：BOHB算法在超过3轮后可能会无法自动停止。
 
 ---
 

diff --git a/README.md b/README.md
@@ -8,12 +8,13 @@
 
 ---
 
-**Vega ver1.6.1 released**
+**Vega ver1.7.0 released**
 
-- Bug Fixes:
+- Feature enhancement:
 
-  - Evaluation time error in log.
-  - Updating error model description while updating record.
+  - Releases Ascend MindStudio version.
+  - Provides data parallel training capabilities for Horovod (GPU) and HCCL (NPU).
+  - Fixed bug: The BOHB algorithm may not automatically stop after more than three rounds.
 
 ---
 

diff --git a/RELEASE.md b/RELEASE.md
@@ -1,4 +1,4 @@
-**Vega ver1.6.1 released:**
+**Vega ver1.7.0 released:**
 
 **Introduction**
 

diff --git a/docs/cn/developer/developer_guide.md b/docs/cn/developer/developer_guide.md
@@ -289,8 +289,6 @@ trainer的主要函数是train_process()，该函数定义如下：
                 self._valid_epoch()
             self.callbacks.after_epoch(epoch)
         self.callbacks.after_train()
-        if self.distributed:
-            self._shutdown_distributed()
 
     def _train_epoch(self):
         if vega.is_torch_backend():
@@ -707,28 +705,3 @@ class PipeStep(object):
         """Do the main task in this pipe step."""
         pass
 ```
-
-## 8. Fully Train
-
-在`Fully Train`上，我们支持单卡训练和基于`Horovod`的多机多卡分布式训练，`Fully Train`对应于`pipeline`的`TrainPipeStep`部分。
-
-### 8.1 配置
-
-如果需要进行`Horovod`分布式训练，需要在`TrainPipeStep`的`trainer`部分的配置文件里加上一个配置项`distributed`，并设置成`True`，如果没有这一项，默认是False，即不使用分布式训练。
-
-```yaml
-fullytrain:
-    pipe_step:
-        type: TrainPipeStep
-    trainer:
-        type: trainer
-        distributed: True
-```
-
-我们通过`shell`启动`Horovod`分布式训练，已经在镜像里完成不同节点之间的通信配置，开发者可以不用关心`vega`内部是如何启动的。
-
-### 8.2 Trainer支持Horovod分布式
-
-在使用分布式训练时，相对于单卡的训练，`trainer`的网络模型、优化器、数据加载等需要使用`Horovod`封装成分布式的对象。
-
-在训练的过程中，单卡和分布式训练的代码几乎是一致的，只是在最后计算验证指标时，需要将不同卡上的指标值综合起来，计算总的平均值。
diff --git a/docs/cn/developer/quick_start.md b/docs/cn/developer/quick_start.md
@@ -171,7 +171,6 @@ nas:
             type: accuracy
         epochs: 3
         save_steps: 250
-        distributed: False
         num_class: 10
     dataset:
         type: Cifar10

diff --git a/docs/cn/user/config_reference.md b/docs/cn/user/config_reference.md
@@ -81,20 +81,13 @@ general:
 
 ## 2.1 并行和分布式
 
-涉及到分布式的配置项有：general.parallel_search, general.parallel_fully_train 和 trainer.distributed，若有多张GPU|NUP，可根据需要选择合适的并行和分布式设置。
+在NAS/HPO搜索过程中，一般一个Trainer对应一个GPU/NPU，若需要一个Trainer对应多个GPU/NPU，可通过修改`general.device_per_trainer`参数。
 
-| general.parallel_search or<br>general.parallel_fully_train | general.devices_per_trainer | trainer.distributed | 分布式和并行方式 |
-| :--: | :--: | :--: | :-- |
-| False | 1 | False | (缺省设置)使用一张卡串行搜索和训练 |
-| False | >1 | False | 使用多张卡串行搜索和训练 |
-| False | >=1 (分配给每个模型的加速卡数量) | True | 使用Horovod/HCCL进行训练 |
-| True | 1 | 任意值 | 并行搜索和训练，每个模型使用一张卡 |
-| True | >1 (分配给每个模型的加速卡数量) | 任意值 | 并行搜索和训练，每个模型使用多张卡 |
-
-如以下是搜索阶段使用2张卡训练一个模型，在完整训练阶段使用Horovod进行训练。
+目前该配置仅支持PyTorch/GPU场景，如下所示。
 
 ```yaml
 general:
+    backend: pytroch
     parallel_search: True
     parallel_fully_train: False
     devices_per_trainer: 2
@@ -143,6 +136,30 @@ fully_train:
         type: Cifar10
 ```
 
+在完整训练阶段，可考虑使用Horovod（GPU）或者HCCL（NPU）两种方式来提供数据分布式模型训练。
+
+如下所示：
+
+```yaml
+pipeline: [fully_train]
+
+fully_train:
+    pipe_step:
+        type: HorovodTrainStep  # HorovodTrainStep(GPU), HcclTrainStep(NPU)
+    trainer:
+        epochs: 160
+    model:
+        model_desc:
+            modules: ['backbone']
+            backbone:
+                type: ResNet
+                num_class: 10
+    dataset:
+        type: Cifar10
+        common:
+            data_path: /cache/datasets/cifar10/
+```
+
 ## 3. NAS和HPO配置项
 
 HPO / NAS的配置项有如下几个主要部分：

diff --git a/docs/cn/user/security_configure.md b/docs/cn/user/security_configure.md
@@ -0,0 +1,41 @@
+# vega 安全配置
+
+## 评估服务器
+### 评估服务器 https 安全配置
+待补充
+### 评估服务器 其他安全配置建议
+#### 评估服务器配置白名单，仅可信的服务器连接评估服务器
+1. linux 白名单配置
+    * 配置白名单：
+        ```
+        sudo iptables -I INPUT -p tcp --dport 评估端口 -j DROP
+        sudo iptables -I INPUT -s 白名单IP地址1 -p tcp --dport 评估端口 -j ACCEPT
+        sudo iptables -I INPUT -s 白名单IP地址2 -p tcp --dport 评估端口 -j ACCEPT
+        sudo iptables -I INPUT -s 白名单IP地址3 -p tcp --dport 评估端口 -j ACCEPT
+        sudo iptables -I INPUT -s 白名单IP地址4 -p tcp --dport 评估端口 -j ACCEPT
+       ```
+    * 如果需要从白名单中删除某一项
+        1. 查询白名单 ```sudo iptables -L -n --line-number```
+        2. 删除白名单 ```sudo iptables -D INPUT 查询的对应行编号```
+    
+2. 配置文件 `.vega/vega.ini` 配置白名单
+    * 在配置中的 limit.white_list 中配置白名单，用逗号分隔
+    ```ini
+    [limit]
+    white_list=127.0.0.1,10.174.183.95
+    ```
+
+#### 评估服务器配置访问频率
+配置文件`.vega/vega.ini` 配置访问频率,默认限制每分钟最大100次访问
+```ini
+[limit]
+request_frequency_limit=5/minute # 配置为每分钟最大5次访问
+```
+
+#### 评估服务器配置请求大小限制
+配置文件`.vega/vega.ini` 配置请求大小限制，可以控制上传文件大小，默认配置 1G
+```ini
+[limit]
+max_content_length=100000 # 配置请求大小最大100K 
+```
+
diff --git a/docs/en/developer/developer_guide.md b/docs/en/developer/developer_guide.md
@@ -293,8 +293,6 @@ The standard trainer training process is implemented in the train_process interf
                 self._valid_epoch()
             self.callbacks.after_epoch(epoch)
         self.callbacks.after_train()
-        if self.distributed:
-            self._shutdown_distributed()
 
     def _train_epoch(self):
         if vega.is_torch_backend():
@@ -712,28 +710,3 @@ class PipeStep(object):
         """Do the main task in this pipe step."""
         pass
 ```
-
-## 8. Fully Train
-
-On `Fully Train`, we support single-card training and multi-device multi-card distributed training based on `Horovod`. `Fully Train` corresponds to `TrainPipeStep` in `pipeline`.
-
-### 8.1 Configuration
-
-If you need to perform `Horovod` distributed training, add the configuration item `distributed` to the `trainer` configuration file of `TrainPipeStep` and set it to `True`. If this configuration item is not added, the default value is False, indicating that distributed training is not used.
-
-```yaml
-fullytrain:
-    pipe_step:
-        type: TrainPipeStep
-    trainer:
-        type: trainer
-        distributed: True
-```
-
-The `shell` is used to start the `Horovod` distributed training. The communication between different nodes has been configured in the image. Developers do not need to care about how the `vega` is started internally.
-
-### 8.2 Distributed Horovod Supported by Trainers
-
-In distributed training, the network model, optimizer, and data loading of the `trainer` need to be encapsulated into distributed objects using the `Horovod`.
-
-During the training, the code of single-card training is almost the same as that of distributed training. However, during the final calculation of verification indicators, the indicator values on different cards need to be combined to calculate the total average value.
diff --git a/docs/en/user/config_reference.md b/docs/en/user/config_reference.md
@@ -80,17 +80,9 @@ general:
 
 ## 2.1 Parallel and distributed
 
-If there are multiple GPU|NUPs in the running environment, select a proper parallel or distributed configuration as required. The configuration items related to distributed deployment are general.parallel_search, general.parallel_fully_train, and trainer.distributed.
+During NAS/HPO search, one trainer corresponds to one GPU/NPU. If one trainer corresponds to multiple GPUs/NPUs, you can modify the `general.device_per_trainer` parameter.
 
-| general.parallel_search or<br>general.parallel_fully_train | general.devices_per_trainer | trainer.distributed | Distributed and parallel modes |
-| :--: | :--: | :--: | :-- |
-| False | 1 | False | (default) Serial search and training with one card |
-| False | >1 | False | Serial Search and Training Using Multiple Cards |
-| False |  >=1<br>(Number of cards assigned to each model) | True | Training with Horovod/HCCL |
-| True | 1 | Any value | Parallel search and training with one card per model |
-| True | >1<br>(Number of cards assigned to each model) | Any value | Parallel search and training with multiple cards per model |
-
-Here's how to train a model using 2 cards during the search phase and Horovod during the full training phase:
+Currently, this configuration works on PyTorch/GPU, as shown in the following:
 
 ```yaml
 general:
@@ -142,6 +134,30 @@ fully_train:
         type: Cifar10
 ```
 
+In the fully training phase, Horovod (GPU) or HCCL (NPU) can be used to provide distributed data model training.
+
+This is as follows:
+
+```yaml
+pipeline: [fully_train]
+
+fully_train:
+    pipe_step:
+        type: HorovodTrainStep  # HorovodTrainStep(GPU), HcclTrainStep(NPU)
+    trainer:
+        epochs: 160
+    model:
+        model_desc:
+            modules: ['backbone']
+            backbone:
+                type: ResNet
+                num_class: 10
+    dataset:
+        type: Cifar10
+        common:
+            data_path: /cache/datasets/cifar10/
+```
+
 ## 3. NAS and HPO configuration items
 
 HPO and NAS configuration items include:

diff --git a/evaluate_service/hardwares/davinci/davinci.py b/evaluate_service/hardwares/davinci/davinci.py
@@ -39,10 +39,11 @@ def convert_model(self, backend, model, weight, **kwargs):
         """
         om_save_path = kwargs["save_dir"]
         input_shape = kwargs["input_shape"]
+        precision = kwargs['precision']
         log_save_path = os.path.dirname(model)
 
         command_line = ["bash", self.current_path + "/model_convert.sh", self.davinci_environment_type, backend, model,
-                        weight, om_save_path, log_save_path, input_shape]
+                        weight, om_save_path, log_save_path, input_shape, precision]
         try:
             subprocess.check_output(command_line)
         except subprocess.CalledProcessError as exc:

diff --git a/evaluate_service/hardwares/davinci/model_convert.sh b/evaluate_service/hardwares/davinci/model_convert.sh
@@ -5,6 +5,7 @@ WEIGHT=$4
 OM_SAVE_PATH=$5
 LOG_SAVE_PATH=$6
 INPUT_SHAPE=$7
+PRECISION=$8
 
 if [ $DAVINCI_ENV_TYPE == "ATLAS200DK" ]; then
     if [ $BACKEND == "tensorflow" ]; then
@@ -16,13 +17,13 @@ if [ $DAVINCI_ENV_TYPE == "ATLAS200DK" ]; then
     fi
 else
     if [ $BACKEND == "tensorflow" ]; then
-        atc --model=$MODEL  --framework=3  --input_format='NCHW'  --disable_reuse_memory=1  --input_shape=$INPUT_SHAPE  --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore  >$LOG_SAVE_PATH/omg.log 2>&1
+        atc --model=$MODEL  --framework=3  --input_format='NCHW'  --disable_reuse_memory=1  --input_shape=$INPUT_SHAPE  --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore  --output_type=$PRECISION >$LOG_SAVE_PATH/omg.log 2>&1
     elif [ $BACKEND == "caffe" ]; then
         atc --model=$MODEL --weight=$WEIGHT --framework=0  --input_format='NCHW' --disable_reuse_memory=1  --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore  >$LOG_SAVE_PATH/omg.log  2>&1
     elif [ $BACKEND == "mindspore" ]; then
-        atc --model=$MODEL  --framework=1  --disable_reuse_memory=1  --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore   >$LOG_SAVE_PATH/omg.log  2>&1
+        atc --model=$MODEL  --framework=1  --disable_reuse_memory=1  --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore  --output_type=$PRECISION >$LOG_SAVE_PATH/omg.log  2>&1
     elif [ $BACKEND == "onnx" ]; then
-        atc --model=$MODEL  --framework=5  --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore   >$LOG_SAVE_PATH/omg.log  2>&1
+        atc --model=$MODEL  --framework=5  --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore  --output_type=$PRECISION  >$LOG_SAVE_PATH/omg.log  2>&1
     else
         echo "[ERROR] Davinci model convert: The backend must be tensorflow, caffe, mindspore or onnx."
     fi

diff --git a/evaluate_service/main.py b/evaluate_service/main.py
@@ -42,6 +42,7 @@
 import traceback
 import argparse
 
+
 app = Flask(__name__)
 api = Api(app)
 
@@ -50,7 +51,7 @@ class Evaluate(Resource):
     """Evaluate Service for service."""
 
     def __init__(self):
-        self.result = {"latency": "9999", "out_data": [], "status": "sucess", "timestamp": ""}
+        self.result = {"latency": "9999", "out_data": [], "status": "sucess", "timestamp": "", "error_message": ""}
 
     @classmethod
     def _add_params(cls, work_path, optional_params):
@@ -70,9 +71,10 @@ def post(self):
             try:
                 self.hardware_instance.convert_model(backend=self.backend, model=self.model, weight=self.weight,
                                                      save_dir=self.share_dir, input_shape=self.input_shape,
-                                                     out_nodes=self.out_nodes)
+                                                     out_nodes=self.out_nodes, precision=self.precision)
             except Exception:
                 self.result["status"] = "Model convert failed."
+                self.result["error_message"] = traceback.format_exc()
                 logging.error("[ERROR] Model convert failed!")
                 traceback.print_exc()
         try:
@@ -85,6 +87,7 @@ def post(self):
             self.result["out_data"] = output
         except Exception:
             self.result["status"] = "Inference failed."
+            self.result["error_message"] = traceback.format_exc()
             logging.error("[ERROR] Inference failed! ")
             traceback.print_exc()
 
@@ -99,6 +102,7 @@ def parse_paras(self):
         self.input_shape = request.form.get("input_shape", type=str, default="")
         self.out_nodes = request.form.get("out_nodes", type=str, default="")
         self.repeat_times = int(request.form.get("repeat_times"))
+        self.precision = request.form.get("precision", type=str, default="FP32")
 
     def upload_files(self):
         """Upload the files from the client to the service."""
@@ -151,7 +155,7 @@ def _parse_args():
     parser.add_argument("-w", "--work_path", type=str, required=True, help="the work dir to save the file")
     parser.add_argument("-t", "--davinci_environment_type", type=str, required=False, default="ATLAS300",
                         help="the type the davinci hardwares")
-    parser.add_argument("-c", "--clean_interval", type=int, required=False, default=1 * 24 * 3600,
+    parser.add_argument("-c", "--clean_interval", type=int, required=False, default=1 * 6 * 3600,
                         help="the time interval to clean the temp folder")
     parser.add_argument("-u", "--ddk_user_name", type=str, required=False, default="user",
                         help="the user to acess ATLAS200200 DK")

diff --git a/examples/compression/prune_ea/prune_finetune_ms.yml b/examples/compression/prune_ea/prune_finetune_ms.yml
@@ -12,7 +12,7 @@ fine_tune:
             type: ResNetMs
             resnet_size: 50
             num_classes: 10
-            need_adjust: True
+        need_adjust: True
         pretrained_model_file: "/cache/models/resnet50-19c8e357.pth"
     trainer:
         type: Trainer

diff --git a/examples/data_augmentation/cyclesr/cyclesr.yml b/examples/data_augmentation/cyclesr/cyclesr.yml
@@ -24,7 +24,6 @@ fully_train:
             save_in_memory: False
             pin_memory: False
             shuffle: True
-            distributed: False
             imgs_per_gpu: 4
             drop_last: True
         test:
@@ -34,7 +33,6 @@ fully_train:
             num_workers: 8
             shuffle: False
             pin_memory: False
-            distributed: False
             imgs_per_gpu: 4
             val_ps_offset: 10
             drop_last: False
@@ -51,7 +49,6 @@ fully_train:
         val_ps_offset: 10
         continue_train: !!null
         lr_policy: linear
-        distributed: False
         model_desc:
             modules: ["custom"]
             custom: