Skip to content

Commit

Permalink
Merge pull request #4310 from spidernet-io/pr/welan/docai
Browse files Browse the repository at this point in the history
optimize ai doc
  • Loading branch information
weizhoublue authored Nov 25, 2024
2 parents 02de076 + ae8d441 commit 747baa6
Show file tree
Hide file tree
Showing 4 changed files with 259 additions and 115 deletions.
96 changes: 66 additions & 30 deletions docs/usage/install/ai/get-started-macvlan-zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@
如下例子,通过 annotations `v1.multus-cni.io/default-network` 指定使用 calico 的缺省网卡,用于进行控制面通信,annotations `k8s.v1.cni.cncf.io/networks` 接入 8 个 GPU 亲和网卡的网卡,用于 RDMA 通信,并配置 8 种 RDMA resources 资源
> 注:可自动为应用注入 RDMA 网络资源,参考 [基于 Webhook 自动注入 RDMA 资源](#基于-webhook-自动注入网络资源)
> 注:可自动为应用注入 RDMA 网络资源,参考 [基于 Webhook 自动注入 RDMA 资源](#基于-webhook-自动注入-rdma-网络资源)
```shell
$ helm repo add spiderchart https://spidernet-io.github.io/charts
Expand Down Expand Up @@ -420,7 +420,7 @@
$ ib_read_lat 172.91.0.115
```
## 基于 Webhook 自动注入网络资源
## 基于 Webhook 自动注入 RDMA 网络资源
在上述步骤中,我们展示了如何使用 SR-IOV 技术在 RoCE 和 Infiniband 网络环境中为容器提供 RDMA 通信能力。然而,当配置多网卡的 AI 应用时,过程会变得复杂。为简化这个过程,Spiderpool 通过 annotations(`cni.spidernet.io/rdma-resource-inject`) 支持对一组网卡配置进行分类。用户只需要为应用添加与网卡配置相同的注解,Spiderpool 就会通过 webhook 自动为应用注入所有具有相同注解的对应网卡和网络资源。
Expand All @@ -432,47 +432,83 @@
~# helm upgrade --install spiderpool spiderpool/spiderpool --namespace spiderpool --create-namespace --reuse-values --set spiderpoolController.podResourceInject.enabled=true
```
> 启用 webhook 自动注入网络资源功能后,您可以通过更新 configMap: spiderpool-config 中的 podResourceInject 字段更新配置。
>
> 您可以通过 `podResourceInject.namespacesExclude` 指定不进行 RDMA 网络资源注入的命名空间,通过 `podResourceInject.namespacesInclude` 指定需要进行 RDMA 网络资源注入的命名空间。
>
> 当前,完成配置变更后,您需要重启 spiderpool-controller 来使配置生效。
> 启用 webhook 自动注入网络资源功能后,您可以通过更新 configMap: spiderpool-config 中的 podResourceInject 字段更新配置。
>
> 通过 `podResourceInject.namespacesExclude` 指定不进行 RDMA 网络资源注入的命名空间
>
> 通过 `podResourceInject.namespacesInclude` 指定需要进行 RDMA 网络资源注入的命名空间,如果 `podResourceInject.namespacesExclude``podResourceInject.namespacesInclude` 都没有指定,则默认对所有命名空间进行 RDMA 网络资源注入。
>
> 当前,完成配置变更后,您需要重启 spiderpool-controller 来使配置生效。
如下的示例中,展示了无需进行 RDMA 网络资源注入的命名空间 `["kube-system", "spiderpool"]`, 需要注入的命名空间 `["test"]`
2. 在创建 AI 算力网络的所有 SpiderMultusConfig 实例时,添加 key 为 "cni.spidernet.io/rdma-resource-inject" 的 annotation,value 可自定义任何值
```yaml
apiVersion: v1
data:
conf.yml: |
enableIPv4: true
...
podResourceInject:
enabled: true
namespacesExclude: ["kube-system", "spiderpool"]
namespacesInclude: ["test"]
apiVersion: spiderpool.spidernet.io/v2beta1
kind: SpiderIPPool
metadata:
name: gpu1-net11
spec:
gateway: 172.16.11.254
subnet: 172.16.11.0/16
ips:
- 172.16.11.1-172.16.11.200
---
apiVersion: spiderpool.spidernet.io/v2beta1
kind: SpiderMultusConfig
metadata:
name: gpu1-sriov
namespace: spiderpool
labels:
cni.spidernet.io/rdma-resource-inject: gpu-network
spec:
cniType: macvlan
macvlan:
master: ["enp11s0f0np0"]
enableRdma: true
rdmaResourceName: spidernet.io/gpu1rdma
ippools:
ipv4: ["gpu1-net11"]
```
2. 如果 AI 应用有多网卡需求,请为需要配置给应用的多个网卡的 SpiderMultusConfig 资源添加如下注解,如何创建 SpiderMultusConfig 资源,请参考[网卡资源创建](#create-spiderpool-resource):
```shell
~# kubectl annotate SpiderMultusConfig -n [命名空间] [资源名称] "cni.spidernet.io/rdma-resource-inject=gpu-macvlan"
# 示例如下:
~# kubectl annotate SpiderMultusConfig -n spiderpool gpu1-macvlan "cni.spidernet.io/rdma-resource-inject=gpu-macvlan"
3. 创建 AI 应用时,为应用也添加相同注解:
```yaml
...
spec:
template:
metadata:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-network
```
> 其中 `cni.spidernet.io/rdma-resource-inject: gpu-macvlan` 的 key:cni.spidernet.io/rdma-resource-inject 是固定的,请不要更改它,而value:gpu-macvlan 可以被用户自定义
> 注意:使用 webhook 自动注入网络资源功能时,不能为应用添加其他网络配置注解(如 `k8s.v1.cni.cncf.io/networks``ipam.spidernet.io ippools`等),否则会影响资源自动注入功能
3. 创建 AI 应用时,为应用也添加相同注解: `cni.spidernet.io/rdma-resource-inject: gpu-macvlan`,这样 Spiderpool 可以自动为应用的每个 Pod 添加多个 GPU 亲和的网卡,用于 RDMA 通信,并配置多种 RDMA 资源:
4. 当 Pod 被创建后,可观测到 Pod 被自动注入了网卡 annotation 和 RDMA 资源
```yaml
...
spec:
template:
metadata:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-macvlan
k8s.v1.cni.cncf.io/networks: |-
[{"name":"gpu1-sriov","namespace":"spiderpool"},
{"name":"gpu2-sriov","namespace":"spiderpool"},
{"name":"gpu3-sriov","namespace":"spiderpool"},
{"name":"gpu4-sriov","namespace":"spiderpool"},
{"name":"gpu5-sriov","namespace":"spiderpool"},
{"name":"gpu6-sriov","namespace":"spiderpool"},
{"name":"gpu7-sriov","namespace":"spiderpool"},
{"name":"gpu8-sriov","namespace":"spiderpool"}]
....
resources:
limits:
spidernet.io/gpu1rdma: 1
spidernet.io/gpu2rdma: 1
spidernet.io/gpu3rdma: 1
spidernet.io/gpu4rdma: 1
spidernet.io/gpu5rdma: 1
spidernet.io/gpu6rdma: 1
spidernet.io/gpu7rdma: 1
spidernet.io/gpu8rdma: 1
```
> 注意:使用 webhook 自动注入网络资源功能时,不能为应用添加其他网络配置注解(如 `k8s.v1.cni.cncf.io/networks``ipam.spidernet.io ippools`等),否则会影响资源自动注入功能。
当 Pod 成功 Running,通过进入 Pod 网络命名空间检查 Pod 是否成功被添加了具备相同注解的所有 RDMA 资源,参考[Pod 网络资源检查](#checking-pod-network)
101 changes: 69 additions & 32 deletions docs/usage/install/ai/get-started-macvlan.md
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ The network planning for the cluster is as follows:
In the following example, the annotation field `v1.multus-cni.io/default-network` specifies the use of the default Calico network card for control plane communication. The annotation field `k8s.v1.cni.cncf.io/networks` connects to the 8 network cards affinitized to the GPU for RDMA communication, and configures 8 types of RDMA resources.
> NOTICE: It support auto inject RDMA resources for application, see [Auto inject RDMA Resources](#auto-inject-rdma-resources-base-on-webhook)
> NOTICE: It support auto inject RDMA resources for application, see [Auto inject RDMA Resources](#auto-inject-rdma-resources-based-on-webhook)
```shell
$ helm repo add spiderchart https://spidernet-io.github.io/charts
Expand Down Expand Up @@ -417,59 +417,96 @@ The network planning for the cluster is as follows:
$ ib_read_lat 172.91.0.115
```
## Auto Inject RDMA Resources base on webhook
In the above steps, we demonstrated how to use SR-IOV technology to provide RDMA communication capabilities for containers in RoCE and Infiniband network environments. However, when configuring AI applications with multiple network cards, the process becomes complicated. To simplify this process, Spiderpool supports classification of a set of network card configurations through annotations (`cni.spidernet.io/rdma-resource-inject`). Users only need to add the same annotation to the application, and Spiderpool will automatically inject all corresponding network cards and network resources with the same annotation into the application through webhook.
## Auto Inject RDMA Resources Based on Webhook
> This feature only supports network card configurations with cniType of [ macvlan,ipvlan,sriov,ib-sriov, ipoib ].
In the steps above, we demonstrated how to use SR-IOV technology to provide RDMA communication capabilities for containers in RoCE and Infiniband network environments. However, the process can become complex when configuring AI applications with multiple network cards. To simplify this process, Spiderpool supports classifying a set of network card configurations through annotations (`cni.spidernet.io/rdma-resource-inject`). Users only need to add the same annotation to the application, and Spiderpool will automatically inject all corresponding network cards and network resources with the same annotation into the application through a webhook.
1. Currently Spiderpool's webhook automatically injects RDMA network resources, which is disabled by default and needs to be enabled manually.
> This feature only supports network card configurations with cniType of [macvlan, ipvlan, sriov, ib-sriov, ipoib].
1. Currently, Spiderpool's webhook for automatically injecting RDMA network resources is disabled by default and needs to be enabled manually.
```shell
~# helm upgrade --install spiderpool spiderpool/spiderpool --namespace spiderpool --create-namespace --reuse-values --set spiderpoolController.podResourceInject.enabled=true
```
> After enabling the webhook automatic injection of network resources, you can update the configuration by updating the podResourceInject field in configMap: spiderpool-config.
>
> You can specify namespaces that do not require RDMA network resource injection through `podResourceInject.namespacesExclude`, and specify namespaces that require RDMA network resource injection through `podResourceInject.namespacesInclude`.
>
> Currently, after completing the configuration change, you need to restart spiderpool-controller for the configuration to take effect.
> After enabling the webhook automatic injection of network resources, you can update the configuration by updating the podResourceInject field in configMap: spiderpool-config.
>
> Specify namespaces that do not require RDMA network resource injection through `podResourceInject.namespacesExclude`.
>
> Specify namespaces that require RDMA network resource injection through `podResourceInject.namespacesInclude`. If neither `podResourceInject.namespacesExclude` nor `podResourceInject.namespacesInclude` is specified, RDMA network resource injection is performed for all namespaces by default.
>
> Currently, after completing the configuration change, you need to restart the spiderpool-controller for the configuration to take effect.
The following example shows the namespaces `["kube-system", "spiderpool"]` that do not need RDMA network resource injection, and the namespace `["test"]` that needs injection.
2. When creating all SpiderMultusConfig instances for AI computing networks, add an annotation with the key "cni.spidernet.io/rdma-resource-inject" and a customizable value.
```yaml
apiVersion: v1
data:
conf.yml: |
enableIPv4: true
...
podResourceInject:
enabled: true
namespacesExclude: ["kube-system", "spiderpool"]
namespacesInclude: ["test"]
apiVersion: spiderpool.spidernet.io/v2beta1
kind: SpiderIPPool
metadata:
name: gpu1-net11
spec:
gateway: 172.16.11.254
subnet: 172.16.11.0/16
ips:
- 172.16.11.1-172.16.11.200
---
apiVersion: spiderpool.spidernet.io/v2beta1
kind: SpiderMultusConfig
metadata:
name: gpu1-sriov
namespace: spiderpool
labels:
cni.spidernet.io/rdma-resource-inject: gpu-network
spec:
cniType: macvlan
macvlan:
master: ["enp11s0f0np0"]
enableRdma: true
rdmaResourceName: spidernet.io/gpu1rdma
ippools:
ipv4: ["gpu1-net11"]
```
2. If your AI application requires multiple network cards, please add the following annotations to the SpiderMultusConfig resources for the multiple network cards that need to be configured for the application. For how to create SpiderMultusConfig resources, please refer to [Network Card Resource Creation](#create-spiderpool-resource):
3. When creating an AI application, add the same annotation to the application:
```bash
~# kubectl annotate SpiderMultusConfig -n [namespace] [resource name] "cni.spidernet.io/rdma-resource-inject=gpu-macvlan"
# The example is as follows:
~# kubectl annotate SpiderMultusConfig -n spiderpool gpu1-macvlan "cni.spidernet.io/rdma-resource-inject=gpu-macvlan"
```yaml
...
spec:
template:
metadata:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-network
```
> The key of `cni.spidernet.io/rdma-resource-inject: gpu-macvlan`: cni.spidernet.io/rdma-resource-inject is fixed, please do not change it, and the value: gpu-macvlan can be customized by the user.gured, otherwise the Pod will fail to inject network resources successfully.
> Note: When using the webhook automatic injection of network resources feature, do not add other network configuration annotations (such as `k8s.v1.cni.cncf.io/networks` and `ipam.spidernet.io/ippools`) to the application, as it will affect the automatic injection of resources.
3. When creating an AI application, add the same annotation to the application: `cni.spidernet.io/rdma-resource-inject: gpu-macvlan`, so that Spiderpool automatically adds multiple GPU-affinity network cards for each Pod of the application for RDMA communication, and configures multiple RDMA resources:
4. Once the Pod is created, you can observe that the Pod has been automatically injected with network card annotations and RDMA resources.
```yaml
...
spec:
template:
metadata:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-macvlan
k8s.v1.cni.cncf.io/networks: |-
[{"name":"gpu1-sriov","namespace":"spiderpool"},
{"name":"gpu2-sriov","namespace":"spiderpool"},
{"name":"gpu3-sriov","namespace":"spiderpool"},
{"name":"gpu4-sriov","namespace":"spiderpool"},
{"name":"gpu5-sriov","namespace":"spiderpool"},
{"name":"gpu6-sriov","namespace":"spiderpool"},
{"name":"gpu7-sriov","namespace":"spiderpool"},
{"name":"gpu8-sriov","namespace":"spiderpool"}]
....
resources:
limits:
spidernet.io/gpu1rdma: 1
spidernet.io/gpu2rdma: 1
spidernet.io/gpu3rdma: 1
spidernet.io/gpu4rdma: 1
spidernet.io/gpu5rdma: 1
spidernet.io/gpu6rdma: 1
spidernet.io/gpu7rdma: 1
spidernet.io/gpu8rdma: 1
```
> Note: When using the webhook automatic injection of network resources feature, do not add other network configuration annotations (such as `k8s.v1.cni.cncf.io/networks` and `ipam.spidernet.io/ippools`) to the Pod, otherwise it will affect the automatic injection of resources.
When the Pod is successfully Running, check whether all RDMA resources with the same annotations have been successfully added to the Pod by entering the Pod network namespace. Refer to [Pod Network Resource Checking](#checking-pod-network)
Loading

0 comments on commit 747baa6

Please sign in to comment.