Bitfusion on Kubernetes中文

Bitfusion on Kubernetes

目前的GPU虚拟化解决方案存在一些不足:

GPU计算力未得到充分利用
无法较好的隔离GPU资源或无法动态的调整资源隔离粒度
只能使用本地的GPU资源
应用程序调度困难

Bitfusion通过提供一个远程GPU池来解决以上这些问题。
Bitfusion将GPU作为一等公民，使得GPU可以像计算资源一样被抽象、分区、自动化和共享。另一方面，Kubernetes已经成为事实上的部署和管理机器学习工作负载的平台。

然而，Kubernetes没有提供一种原生方式来使用Bitfusion的远程GPU池。这一限制成为Kubernetes上的作业使用Bitfusion GPU的关键挑战。Kubernetes需要一种友好的方式来使用Bitfusion GPU资源:

支持资源管理
支持GPU池管理

本项目通过允许Kubernetes使用Bitfusion的方式来解决这些问题

架构

我们通过以下两个组件来实现允许Kubernetes使用Bitfusion的目的。

bitfusion-device-plugin
bitfusion-webhook

组件1和组件2分别内置在独立的Docker 镜像中。 bitfusion-device-plugin 作为DaemonSet 运行在kubelet所在的每个工作节点上。 bitfusion-webhook 作为Deployment 运行在Kubernetes主节点上。

前置条件

安装机器的操作系统为Ubuntu Linux
OpenSSL需要安装在Ubuntu上
Kubernetes 1.17+
Bitfusion 2.5+
kubectl和docker命令可以正常使用

获取Bitfusion的token文件

为了启用Bitfusion，用户必须生成一个用于授权的Bitfusion Token，并将相关的tar文件下载到安装机器上。按照以下步骤从vCenter获取Token:
Step 1. 登录 vCenter
Step 2. 在插件部分点击Bitfusion

Step 3. 选择Tokens 标签，然后选择要下载的适当的token

Step 4. 点击 DOWNLOAD 按钮, 在此之前需要确保token是可用的

如果列表中没有可用的token, 需要点击 NEW TOKEN 来创建Token. 更多详情，请参阅: https://docs.vmware.com/en/VMware-vSphere-Bitfusion/2.5/Install-Guide/GUID-361A9C59-BB22-4BF0-9B05-8E80DE70BE5B.html

使用Bitfusion token创建Kubernetes secret

将Bitfusion Token文件上传到安装机器。使用以下命令解压缩文件:

$ mkdir tokens   
$ tar -xvf ./2BgkZdN.tar -C tokens

现在我们得到了三个文件在目录 tokens/ 中，分别是: ca.crt, client.yaml 和 services.conf :

tokens
├── ca.crt
├── client.yaml
└── servers.conf

然后使用以下命令在Kubernetes中的kube-system 命名空间中创建一个secret :

$ kubectl create secret generic bitfusion-secret --from-file=tokens -n kube-system

更多关于kubectl 信息，请参阅: https://kubernetes.io/docs/reference/kubectl/overview/

快速开始

这里有两种部署的方式，可以选择其中任意一种:

使用已经构建好的容器镜像部署
使用代码构建容器镜像，并用生成的镜像部署

部署方式1: 使用已经构建好的容器镜像部署 (推荐)

使用以下命令克隆源代码:

$ git clone https://github.com/vmware/bitfusion-with-kubernetes-integration.git

使用以下命令部署Bitfusion device plugin和其他相关组件，需要确保Kubernetes集群可以连接到Internet。

$ cd bitfusion-with-kubernetes-integration-main/bitfusion_device_plugin
$ make deploy

部署方式2: 使用代码构建容器镜像，并用生成的镜像部署

用户可以选择从源代码构建容器镜像，而不是使用预先构建的容器镜像。在镜像构建完成后，它们也可以被推送到镜像仓库 (Docker Hub或本地的镜像仓库)。

使用以下命令克隆源代码:

$ git clone https://github.com/vmware/bitfusion-with-kubernetes-integration.git

在开始构建过程之前，需要修改Makefile中一些变量的值:

$ cd bitfusion-with-kubernetes-integration-main/bitfusion_device_plugin
$ vim Makefile

大多数变量的值不需要更改。如果要将镜像推送到镜像仓库，请确保IMAGE_REPO变量的值被设置为所选择的正确镜像仓库地址 (它的默认值是 docker.io/bitfusiondeviceplugin ):

# Variables below are the configuration of Docker images and repo for this project.
# Update these variable values with your own configuration if necessary.

IMAGE_REPO ?= docker.io/bitfusiondeviceplugin
DEVICE_IMAGE_NAME ?= bitfusion-device-plugin
WEBHOOK_IMAGE_NAME ?= bitfusion-webhook
PKG_IMAGE_NAME ?= bitfusion-client
IMAGE_TAG  ?= 0.1

现在可以使用下面的命令构建容器镜像了:

$ make build-image

可以用下面的命令检查容器镜像的构建结果:

$ docker images
REPOSITORY                                                                         TAG
docker.io/bitfusiondeviceplugin/bitfusion-device-plugin                            0.1                
docker.io/bitfusiondeviceplugin/bitfusion-webhook                                  0.1                
docker.io/bitfusiondeviceplugin/bitfusion-client                                   0.1

(推荐使用的可选项)如果需要将容器镜像推送到容器仓库，请使用以下命令推送它们。如果需要，可以使用“docker login”命令登录镜像仓库。(如何使用docker login?)

$ make push-image

注意: 如果没有可用的镜像仓库，可以将容器镜像导出到文件，然后拷贝到Kubernetes集群的每个工作节点。使用docker命令将容器镜像文件保存为tar文件，并手动分发到Kubernetes节点。然后从每个节点上的tar文件加载容器镜像，详见docker命令文档。

使用以下命令部署Bitfusion device plugin和其他相关组件:

$ make deploy

部署的环境的验证

通过"部署方式1"或"部署方式2"完成安装后，使用以下命令查看命名空间"bwki"中是否正确启动了所有组件:

检查 device plugin 是否正在运行:

$ kubectl get pods -n kube-system

NAME                            READY   STATUS    RESTARTS   AGE
bitfusion-device-plugin-cfr87   1/1     Running   0          6m13s

检查webhook 是否正在运行:

$ kubectl  get pod -n bwki

NAME                                            READY   STATUS    RESTARTS   AGE
bitfusion-webhook-deployment-6dbc6df664-td6t7   1/1     Running   0          7m49s

检查其他部署的组件状态

$ kubectl get configmap -n bwki

NAME                                DATA   AGE
bwki-webhook-configmap              1      71m

$ kubectl get serviceaccount  -n bwki

NAME                           SECRETS   AGE
bitfusion-webhook-deployment   1         72m

$ kubectl get ValidatingWebhookConfiguration  -n bwki

NAME                          CREATED AT
validation.bitfusion.io-cfg   2021-03-25T05:29:17Z

$ kubectl get MutatingWebhookConfiguration   -n bwki

NAME                          CREATED AT
bwki-webhook-cfg              2021-03-25T05:29:17Z

$ kubectl get svc   -n bwki

NAME                          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
bwki-webhook-svc              ClusterIP   10.101.39.4   <none>        443/TCP   76m

在Kubernetes的作业中使用Bitfusion资源
完成安装后，用户可以编写Kubernetes的YAML文件来使用Bitfusion资源。在YAML文件中有三个与Bitfusion资源相关的参数:

auto-management/bitfusion: yes / no
# 是否要使用Bitfusion device plugin
bitfusion.io/gpu-num:
# 需要从Bitfusion集群中请求的GPU数量
bitfusion.io/gpu-percent:
# 所请求GPU的百分比

下面是一个运行Tensorflow基准测试的Pod YAML示例。变量hostPath是Tensorflow Benchmarks在主机上的路径，它需要被挂载到pod中

apiVersion: v1
kind: Pod
metadata:
annotations:
# “yes”表示此Pod启用了Bitfusion device plugin
auto-management/bitfusion: "yes"
name: bf-pkgs
# 可以在此填入用户指定的命名空间
namespace: tensorflow-benchmark
spec:
containers:
- image: nvcr.io/nvidia/tensorflow:19.07-py3
imagePullPolicy: IfNotPresent
name: bf-pkgs
command: ["python /benchmark/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --batch_size=32 --model=inception3"]
resources:
limits:
# 从Bitfusion集群为这个Pod申请一个GPU
bitfusion.io/gpu-num: 1
# 每个GPU申请50%的资源
bitfusion.io/gpu-percent: 50
volumeMounts:
- name: code
mountPath: /benchmark
volumes:
- name: code
# 用于测试的Benchmarks 代码来自: https://github.com/tensorflow/benchmarks/tree/tf_benchmark_stage
# 请确保节点上的/home/benchmarks目录中有相应的内容
hostPath:
path: /home/benchmarks

然后使用以下命令对pod.yaml进行部署:

$ kubectl create namespace tensorflow-benchmark
$ kubectl create -f example/pod.yaml

如果Pod成功运行，输出如下所示:

[INFO] 2021-03-27T04:26:40Z Query server 192.168.1.100:56001 gpu availability
[INFO] 2021-03-27T04:26:41Z Choosing GPUs from server list [192.168.1.100:56001]
[INFO] 2021-03-27T04:26:41Z Requesting GPUs [0] with 8080 MiB of memory from server 0, with version 2.5.0-fd3e4839...
Requested resources:
Server List: 192.168.1.100:56001
Client idle timeout: 0 min
[INFO] 2021-03-27T04:26:42Z Locked 1 GPUs with partial memory 0.5, configuration saved to '/tmp/bitfusion125236687'
[INFO] 2021-03-27T04:26:42Z Running client command 'python /benchmark/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --batch_size=32 --model=inception3' on 1 GPUs, with the following servers:
[INFO] 2021-03-27T04:26:42Z 192.168.1.100 55001 ab4a56d5-8df4-4c93-891d-1c5814cf83f6 56001 2.5.0-fd3e4839

2021-03-27 04:26:43.511803: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1

......

Instructions for updating:
non-resource variables are not supported in the long term
2021-03-27 04:26:48.173243: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394455000 Hz
2021-03-27 04:26:48.174378: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c8ad60 executing computations on platform Host. Devices:
2021-03-27 04:26:48.174426: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2021-03-27 04:26:48.184024: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2021-03-27 04:26:54.831820: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-27 04:26:55.195722: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c927b0 executing computations on platform CUDA. Devices:
2021-03-27 04:26:55.195825: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2021-03-27 04:26:56.476786: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-27 04:26:56.846965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:00:00.0

......

TensorFlow:  1.14
Model:       inception3
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step    Img/sec total_loss
1       images/sec: 199.4 +/- 0.0 (jitter = 0.0)        7.312
10      images/sec: 196.6 +/- 2.1 (jitter = 5.7)        7.290
20      images/sec: 198.3 +/- 1.3 (jitter = 4.5)        7.351
30      images/sec: 198.4 +/- 0.9 (jitter = 3.8)        7.300
40      images/sec: 199.4 +/- 0.8 (jitter = 4.1)        7.250
50      images/sec: 199.8 +/- 0.7 (jitter = 4.6)        7.283
60      images/sec: 200.1 +/- 0.6 (jitter = 4.2)        7.301
70      images/sec: 199.8 +/- 0.6 (jitter = 4.2)        7.266
80      images/sec: 200.1 +/- 0.6 (jitter = 4.4)        7.286
90      images/sec: 199.9 +/- 0.5 (jitter = 4.4)        7.334
100     images/sec: 199.9 +/- 0.5 (jitter = 4.0)        7.380
----------------------------------------------------------------
total images/sec: 199.65
----------------------------------------------------------------

......

当作业完成时，使用以下命令删除POD:

$ kubectl delete -f example/pod.yaml

Troubleshooting

如果pod 没有成功运行，使用下面的命令检查日志中的详细信息

$ kubectl logs -n tensorflow-benchmark   bf-pkgs

“tensorflow-benchmark”是pod的命名空间。“bf-pkgs”是pod的名称

下面的日志显示了一些链接Bitfusion server的错误

检查vCenter Bitfusion插件中Bitfusion token的有效性。重新下载一个新的token，并使用以下命令更新Kubernetes中的secret :(确保在Kubernetes的每个命名空间中删除所有旧的bitfusion-secret)

$ kubectl delete secret -n kube-system bitfusion-secret
$ kubectl delete secret -n tensorflow-benchmark  bitfusion-secret
$ kubectl create secret generic bitfusion-secret --from-file=tokens -n kube-system
$ kubectl create secret generic bitfusion-secret --from-file=tokens -n kube-system
$ make deploy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly