Skip to content

Commit

Permalink
Merge pull request #140 from huawei-noah/zjj_release_1.7.0
Browse files Browse the repository at this point in the history
release 1.7.0
  • Loading branch information
zhangjiajin authored Sep 27, 2021
2 parents 0e0354e + 61602fd commit 1717008
Show file tree
Hide file tree
Showing 302 changed files with 16,223 additions and 3,000 deletions.
9 changes: 5 additions & 4 deletions README.cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@

---

**Vega ver1.6.1 发布**
**Vega ver1.7.0 发布**

- Bug Fixes
- 特性增强

- 日志打印中的评估时间错误。
- 更新Record时错误更新了模型描述。
- 提供用于Ascend MindStudio的发布版本。
- 提供Horovod(GPU)和HCCL(NPU)的数据并行训练能力。
- 修复BUG:BOHB算法在超过3轮后可能会无法自动停止。

---

Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,13 @@

---

**Vega ver1.6.1 released**
**Vega ver1.7.0 released**

- Bug Fixes:
- Feature enhancement:

- Evaluation time error in log.
- Updating error model description while updating record.
- Releases Ascend MindStudio version.
- Provides data parallel training capabilities for Horovod (GPU) and HCCL (NPU).
- Fixed bug: The BOHB algorithm may not automatically stop after more than three rounds.

---

Expand Down
2 changes: 1 addition & 1 deletion RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
**Vega ver1.6.1 released:**
**Vega ver1.7.0 released:**

**Introduction**

Expand Down
27 changes: 0 additions & 27 deletions docs/cn/developer/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,8 +289,6 @@ trainer的主要函数是train_process(),该函数定义如下:
self._valid_epoch()
self.callbacks.after_epoch(epoch)
self.callbacks.after_train()
if self.distributed:
self._shutdown_distributed()
def _train_epoch(self):
if vega.is_torch_backend():
Expand Down Expand Up @@ -707,28 +705,3 @@ class PipeStep(object):
"""Do the main task in this pipe step."""
pass
```

## 8. Fully Train

在`Fully Train`上,我们支持单卡训练和基于`Horovod`的多机多卡分布式训练,`Fully Train`对应于`pipeline`的`TrainPipeStep`部分。

### 8.1 配置

如果需要进行`Horovod`分布式训练,需要在`TrainPipeStep`的`trainer`部分的配置文件里加上一个配置项`distributed`,并设置成`True`,如果没有这一项,默认是False,即不使用分布式训练。

```yaml
fullytrain:
pipe_step:
type: TrainPipeStep
trainer:
type: trainer
distributed: True
```

我们通过`shell`启动`Horovod`分布式训练,已经在镜像里完成不同节点之间的通信配置,开发者可以不用关心`vega`内部是如何启动的。

### 8.2 Trainer支持Horovod分布式

在使用分布式训练时,相对于单卡的训练,`trainer`的网络模型、优化器、数据加载等需要使用`Horovod`封装成分布式的对象。

在训练的过程中,单卡和分布式训练的代码几乎是一致的,只是在最后计算验证指标时,需要将不同卡上的指标值综合起来,计算总的平均值。
1 change: 0 additions & 1 deletion docs/cn/developer/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,6 @@ nas:
type: accuracy
epochs: 3
save_steps: 250
distributed: False
num_class: 10
dataset:
type: Cifar10
Expand Down
37 changes: 27 additions & 10 deletions docs/cn/user/config_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,20 +81,13 @@ general:
## 2.1 并行和分布式
涉及到分布式的配置项有:general.parallel_search, general.parallel_fully_train 和 trainer.distributed,若有多张GPU|NUP,可根据需要选择合适的并行和分布式设置
在NAS/HPO搜索过程中,一般一个Trainer对应一个GPU/NPU,若需要一个Trainer对应多个GPU/NPU,可通过修改`general.device_per_trainer`参数

| general.parallel_search or<br>general.parallel_fully_train | general.devices_per_trainer | trainer.distributed | 分布式和并行方式 |
| :--: | :--: | :--: | :-- |
| False | 1 | False | (缺省设置)使用一张卡串行搜索和训练 |
| False | >1 | False | 使用多张卡串行搜索和训练 |
| False | >=1 (分配给每个模型的加速卡数量) | True | 使用Horovod/HCCL进行训练 |
| True | 1 | 任意值 | 并行搜索和训练,每个模型使用一张卡 |
| True | >1 (分配给每个模型的加速卡数量) | 任意值 | 并行搜索和训练,每个模型使用多张卡 |
如以下是搜索阶段使用2张卡训练一个模型,在完整训练阶段使用Horovod进行训练。
目前该配置仅支持PyTorch/GPU场景,如下所示。

```yaml
general:
backend: pytroch
parallel_search: True
parallel_fully_train: False
devices_per_trainer: 2
Expand Down Expand Up @@ -143,6 +136,30 @@ fully_train:
type: Cifar10
```

在完整训练阶段,可考虑使用Horovod(GPU)或者HCCL(NPU)两种方式来提供数据分布式模型训练。

如下所示:

```yaml
pipeline: [fully_train]
fully_train:
pipe_step:
type: HorovodTrainStep # HorovodTrainStep(GPU), HcclTrainStep(NPU)
trainer:
epochs: 160
model:
model_desc:
modules: ['backbone']
backbone:
type: ResNet
num_class: 10
dataset:
type: Cifar10
common:
data_path: /cache/datasets/cifar10/
```

## 3. NAS和HPO配置项

HPO / NAS的配置项有如下几个主要部分:
Expand Down
41 changes: 41 additions & 0 deletions docs/cn/user/security_configure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# vega 安全配置

## 评估服务器
### 评估服务器 https 安全配置
待补充
### 评估服务器 其他安全配置建议
#### 评估服务器配置白名单,仅可信的服务器连接评估服务器
1. linux 白名单配置
* 配置白名单:
```
sudo iptables -I INPUT -p tcp --dport 评估端口 -j DROP
sudo iptables -I INPUT -s 白名单IP地址1 -p tcp --dport 评估端口 -j ACCEPT
sudo iptables -I INPUT -s 白名单IP地址2 -p tcp --dport 评估端口 -j ACCEPT
sudo iptables -I INPUT -s 白名单IP地址3 -p tcp --dport 评估端口 -j ACCEPT
sudo iptables -I INPUT -s 白名单IP地址4 -p tcp --dport 评估端口 -j ACCEPT
```
* 如果需要从白名单中删除某一项
1. 查询白名单 ```sudo iptables -L -n --line-number```
2. 删除白名单 ```sudo iptables -D INPUT 查询的对应行编号```
2. 配置文件 `.vega/vega.ini` 配置白名单
* 在配置中的 limit.white_list 中配置白名单,用逗号分隔
```ini
[limit]
white_list=127.0.0.1,10.174.183.95
```
#### 评估服务器配置访问频率
配置文件`.vega/vega.ini` 配置访问频率,默认限制每分钟最大100次访问
```ini
[limit]
request_frequency_limit=5/minute # 配置为每分钟最大5次访问
```

#### 评估服务器配置请求大小限制
配置文件`.vega/vega.ini` 配置请求大小限制,可以控制上传文件大小,默认配置 1G
```ini
[limit]
max_content_length=100000 # 配置请求大小最大100K
```

27 changes: 0 additions & 27 deletions docs/en/developer/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -293,8 +293,6 @@ The standard trainer training process is implemented in the train_process interf
self._valid_epoch()
self.callbacks.after_epoch(epoch)
self.callbacks.after_train()
if self.distributed:
self._shutdown_distributed()

def _train_epoch(self):
if vega.is_torch_backend():
Expand Down Expand Up @@ -712,28 +710,3 @@ class PipeStep(object):
"""Do the main task in this pipe step."""
pass
```

## 8. Fully Train

On `Fully Train`, we support single-card training and multi-device multi-card distributed training based on `Horovod`. `Fully Train` corresponds to `TrainPipeStep` in `pipeline`.

### 8.1 Configuration

If you need to perform `Horovod` distributed training, add the configuration item `distributed` to the `trainer` configuration file of `TrainPipeStep` and set it to `True`. If this configuration item is not added, the default value is False, indicating that distributed training is not used.

```yaml
fullytrain:
pipe_step:
type: TrainPipeStep
trainer:
type: trainer
distributed: True
```

The `shell` is used to start the `Horovod` distributed training. The communication between different nodes has been configured in the image. Developers do not need to care about how the `vega` is started internally.

### 8.2 Distributed Horovod Supported by Trainers

In distributed training, the network model, optimizer, and data loading of the `trainer` need to be encapsulated into distributed objects using the `Horovod`.

During the training, the code of single-card training is almost the same as that of distributed training. However, during the final calculation of verification indicators, the indicator values on different cards need to be combined to calculate the total average value.
36 changes: 26 additions & 10 deletions docs/en/user/config_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,17 +80,9 @@ general:
## 2.1 Parallel and distributed
If there are multiple GPU|NUPs in the running environment, select a proper parallel or distributed configuration as required. The configuration items related to distributed deployment are general.parallel_search, general.parallel_fully_train, and trainer.distributed.
During NAS/HPO search, one trainer corresponds to one GPU/NPU. If one trainer corresponds to multiple GPUs/NPUs, you can modify the `general.device_per_trainer` parameter.

| general.parallel_search or<br>general.parallel_fully_train | general.devices_per_trainer | trainer.distributed | Distributed and parallel modes |
| :--: | :--: | :--: | :-- |
| False | 1 | False | (default) Serial search and training with one card |
| False | >1 | False | Serial Search and Training Using Multiple Cards |
| False | >=1<br>(Number of cards assigned to each model) | True | Training with Horovod/HCCL |
| True | 1 | Any value | Parallel search and training with one card per model |
| True | >1<br>(Number of cards assigned to each model) | Any value | Parallel search and training with multiple cards per model |
Here's how to train a model using 2 cards during the search phase and Horovod during the full training phase:
Currently, this configuration works on PyTorch/GPU, as shown in the following:

```yaml
general:
Expand Down Expand Up @@ -142,6 +134,30 @@ fully_train:
type: Cifar10
```

In the fully training phase, Horovod (GPU) or HCCL (NPU) can be used to provide distributed data model training.

This is as follows:

```yaml
pipeline: [fully_train]
fully_train:
pipe_step:
type: HorovodTrainStep # HorovodTrainStep(GPU), HcclTrainStep(NPU)
trainer:
epochs: 160
model:
model_desc:
modules: ['backbone']
backbone:
type: ResNet
num_class: 10
dataset:
type: Cifar10
common:
data_path: /cache/datasets/cifar10/
```

## 3. NAS and HPO configuration items

HPO and NAS configuration items include:
Expand Down
3 changes: 2 additions & 1 deletion evaluate_service/hardwares/davinci/davinci.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,11 @@ def convert_model(self, backend, model, weight, **kwargs):
"""
om_save_path = kwargs["save_dir"]
input_shape = kwargs["input_shape"]
precision = kwargs['precision']
log_save_path = os.path.dirname(model)

command_line = ["bash", self.current_path + "/model_convert.sh", self.davinci_environment_type, backend, model,
weight, om_save_path, log_save_path, input_shape]
weight, om_save_path, log_save_path, input_shape, precision]
try:
subprocess.check_output(command_line)
except subprocess.CalledProcessError as exc:
Expand Down
7 changes: 4 additions & 3 deletions evaluate_service/hardwares/davinci/model_convert.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ WEIGHT=$4
OM_SAVE_PATH=$5
LOG_SAVE_PATH=$6
INPUT_SHAPE=$7
PRECISION=$8

if [ $DAVINCI_ENV_TYPE == "ATLAS200DK" ]; then
if [ $BACKEND == "tensorflow" ]; then
Expand All @@ -16,13 +17,13 @@ if [ $DAVINCI_ENV_TYPE == "ATLAS200DK" ]; then
fi
else
if [ $BACKEND == "tensorflow" ]; then
atc --model=$MODEL --framework=3 --input_format='NCHW' --disable_reuse_memory=1 --input_shape=$INPUT_SHAPE --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore >$LOG_SAVE_PATH/omg.log 2>&1
atc --model=$MODEL --framework=3 --input_format='NCHW' --disable_reuse_memory=1 --input_shape=$INPUT_SHAPE --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore --output_type=$PRECISION >$LOG_SAVE_PATH/omg.log 2>&1
elif [ $BACKEND == "caffe" ]; then
atc --model=$MODEL --weight=$WEIGHT --framework=0 --input_format='NCHW' --disable_reuse_memory=1 --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore >$LOG_SAVE_PATH/omg.log 2>&1
elif [ $BACKEND == "mindspore" ]; then
atc --model=$MODEL --framework=1 --disable_reuse_memory=1 --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore >$LOG_SAVE_PATH/omg.log 2>&1
atc --model=$MODEL --framework=1 --disable_reuse_memory=1 --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore --output_type=$PRECISION >$LOG_SAVE_PATH/omg.log 2>&1
elif [ $BACKEND == "onnx" ]; then
atc --model=$MODEL --framework=5 --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore >$LOG_SAVE_PATH/omg.log 2>&1
atc --model=$MODEL --framework=5 --output=$OM_SAVE_PATH/davinci_model --soc_version=Ascend310 --core_type=AiCore --output_type=$PRECISION >$LOG_SAVE_PATH/omg.log 2>&1
else
echo "[ERROR] Davinci model convert: The backend must be tensorflow, caffe, mindspore or onnx."
fi
Expand Down
10 changes: 7 additions & 3 deletions evaluate_service/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
import traceback
import argparse


app = Flask(__name__)
api = Api(app)

Expand All @@ -50,7 +51,7 @@ class Evaluate(Resource):
"""Evaluate Service for service."""

def __init__(self):
self.result = {"latency": "9999", "out_data": [], "status": "sucess", "timestamp": ""}
self.result = {"latency": "9999", "out_data": [], "status": "sucess", "timestamp": "", "error_message": ""}

@classmethod
def _add_params(cls, work_path, optional_params):
Expand All @@ -70,9 +71,10 @@ def post(self):
try:
self.hardware_instance.convert_model(backend=self.backend, model=self.model, weight=self.weight,
save_dir=self.share_dir, input_shape=self.input_shape,
out_nodes=self.out_nodes)
out_nodes=self.out_nodes, precision=self.precision)
except Exception:
self.result["status"] = "Model convert failed."
self.result["error_message"] = traceback.format_exc()
logging.error("[ERROR] Model convert failed!")
traceback.print_exc()
try:
Expand All @@ -85,6 +87,7 @@ def post(self):
self.result["out_data"] = output
except Exception:
self.result["status"] = "Inference failed."
self.result["error_message"] = traceback.format_exc()
logging.error("[ERROR] Inference failed! ")
traceback.print_exc()

Expand All @@ -99,6 +102,7 @@ def parse_paras(self):
self.input_shape = request.form.get("input_shape", type=str, default="")
self.out_nodes = request.form.get("out_nodes", type=str, default="")
self.repeat_times = int(request.form.get("repeat_times"))
self.precision = request.form.get("precision", type=str, default="FP32")

def upload_files(self):
"""Upload the files from the client to the service."""
Expand Down Expand Up @@ -151,7 +155,7 @@ def _parse_args():
parser.add_argument("-w", "--work_path", type=str, required=True, help="the work dir to save the file")
parser.add_argument("-t", "--davinci_environment_type", type=str, required=False, default="ATLAS300",
help="the type the davinci hardwares")
parser.add_argument("-c", "--clean_interval", type=int, required=False, default=1 * 24 * 3600,
parser.add_argument("-c", "--clean_interval", type=int, required=False, default=1 * 6 * 3600,
help="the time interval to clean the temp folder")
parser.add_argument("-u", "--ddk_user_name", type=str, required=False, default="user",
help="the user to acess ATLAS200200 DK")
Expand Down
2 changes: 1 addition & 1 deletion examples/compression/prune_ea/prune_finetune_ms.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ fine_tune:
type: ResNetMs
resnet_size: 50
num_classes: 10
need_adjust: True
need_adjust: True
pretrained_model_file: "/cache/models/resnet50-19c8e357.pth"
trainer:
type: Trainer
Expand Down
3 changes: 0 additions & 3 deletions examples/data_augmentation/cyclesr/cyclesr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ fully_train:
save_in_memory: False
pin_memory: False
shuffle: True
distributed: False
imgs_per_gpu: 4
drop_last: True
test:
Expand All @@ -34,7 +33,6 @@ fully_train:
num_workers: 8
shuffle: False
pin_memory: False
distributed: False
imgs_per_gpu: 4
val_ps_offset: 10
drop_last: False
Expand All @@ -51,7 +49,6 @@ fully_train:
val_ps_offset: 10
continue_train: !!null
lr_policy: linear
distributed: False
model_desc:
modules: ["custom"]
custom:
Expand Down
Loading

0 comments on commit 1717008

Please sign in to comment.