Skip to content

Commit

Permalink
podRdmaResourceInject: use annotaion without label to a same group of…
Browse files Browse the repository at this point in the history
… spiderMultusConfig

Signed-off-by: cyclinder <[email protected]>
  • Loading branch information
cyclinder committed Nov 4, 2024
1 parent 646ed6d commit 7fdd696
Show file tree
Hide file tree
Showing 12 changed files with 400 additions and 142 deletions.
6 changes: 3 additions & 3 deletions docs/usage/install/ai/get-started-macvlan-zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,7 @@
## 基于 Webhook 自动注入网络资源
Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 labels(`cni.spidernet.io/rdma-resource-inject`) 对一组网卡配置分类。用户只需要为 Pod 添加相同的注解。这样 Spiderpool 会通过 webhook 自动为 Pod 注入所有具有相同 label 的对应的网卡和网络资源。
Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 annotations (`cni.spidernet.io/rdma-resource-inject`) 对一组网卡配置分类。用户只需要为 Pod 添加相同的注解。这样 Spiderpool 会通过 webhook 自动为 Pod 注入所有具有相同 annotation 的对应的网卡和网络资源。
> 该功能仅支持 [ macvlan,ipvlan,sriov,ib-sriov, ipoib ] 这几种 cniType 的网卡配置。
Expand All @@ -438,7 +438,7 @@ Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 labe
metadata:
name: gpu1-macvlan
namespace: spiderpool
labels:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-macvlan
spec:
cniType: macvlan
Expand All @@ -451,7 +451,7 @@ Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 labe
EOF
```
> - `cni.spidernet.io/rdma-resource-inject: gpu-macvlan` 固定的 key,value 为用户自定义。具有相同 `Label``Value` 的一组网卡配置要求 `cniType` 必须一致。
> - `cni.spidernet.io/rdma-resource-inject: gpu-macvlan` 固定的 key,value 为用户自定义。
> - `enableRdma`, `rdmaResourceName``ippools` 必须配置,否则 Pod 无法成功注入网络资源。
3. 创建应用时添加注解: `cni.spidernet.io/rdma-resource-inject: gpu-macvlan`,这样 Spiderpool 自动为 Pod 添加 8 个 GPU 亲和网卡的网卡,用于 RDMA 通信,并配置 8 种 RDMA resources 资源:
Expand Down
6 changes: 3 additions & 3 deletions docs/usage/install/ai/get-started-macvlan.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,7 +413,7 @@ The network planning for the cluster is as follows:
## Auto Inject RDMA Resources base on webhook
To simplify the complexity of configuring multiple network cards for AI applications, Spiderpool supports categorizing a group of network card configurations through labels (cni.spidernet.io/rdma-resource-inject). Users only need to add the same annotation to the Pod. This way, Spiderpool will automatically inject all corresponding network cards and network resources with the same label into the Pod through a webhook.
To simplify the complexity of configuring multiple network cards for AI applications, Spiderpool supports categorizing a group of network card configurations through annotations (cni.spidernet.io/rdma-resource-inject). Users only need to add the same annotation to the Pod. This way, Spiderpool will automatically inject all corresponding network cards and network resources with the same label into the Pod through a webhook.
> This feature only supports network card configurations with cniType of [ macvlan,ipvlan,sriov,ib-sriov, ipoib ].
Expand All @@ -436,7 +436,7 @@ To simplify the complexity of configuring multiple network cards for AI applicat
metadata:
name: gpu1-macvlan
namespace: spiderpool
labels:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-macvlan
spec:
cniType: macvlan
Expand All @@ -449,7 +449,7 @@ To simplify the complexity of configuring multiple network cards for AI applicat
EOF
```
> - `cni.spidernet.io/rdma-resource-inject: gpu-macvlan` is a fixed key, and the value is user-defined. A group of network card configurations with the same Label and Value must have the same cniType.
> - `cni.spidernet.io/rdma-resource-inject: gpu-macvlan` is a fixed key, and the value is user-defined.
> - `enableRdma`, `rdmaResourceName` and `ippools` must be configured, otherwise the Pod will fail to inject network resources successfully.
3. Add the annotation `cni.spidernet.io/rdma-resource-inject: gpu-macvlan` to the Pod, so that Spiderpool automatically adds 8 GPU-affinity network cards for RDMA communication and configures 8 types of RDMA resources:
Expand Down
60 changes: 30 additions & 30 deletions docs/usage/install/ai/get-started-sriov-zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,39 +211,39 @@ Spiderpool 使用了 [sriov-network-operator](https://github.com/k8snetworkplumb
name: gpu1-nic-policy
namespace: spiderpool
spec:
nodeSelector:
kubernetes.io/os: "linux"
resourceName: gpu1sriov
priority: 99
numVfs: 12
nicSelector:
deviceID: "1017"
vendor: "15b3"
rootDevices:
- 0000:86:00.0
linkType: ${LINK_TYPE}
deviceType: netdevice
isRdma: true
nodeSelector:
kubernetes.io/os: "linux"
resourceName: gpu1sriov
priority: 99
numVfs: 12
nicSelector:
deviceID: "1017"
vendor: "15b3"
rootDevices:
- 0000:86:00.0
linkType: ${LINK_TYPE}
deviceType: netdevice
isRdma: true
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: gpu2-nic-policy
namespace: spiderpool
spec:
nodeSelector:
kubernetes.io/os: "linux"
resourceName: gpu2sriov
priority: 99
numVfs: 12
nicSelector:
deviceID: "1017"
vendor: "15b3"
rootDevices:
- 0000:86:00.0
linkType: ${LINK_TYPE}
deviceType: netdevice
isRdma: true
nodeSelector:
kubernetes.io/os: "linux"
resourceName: gpu2sriov
priority: 99
numVfs: 12
nicSelector:
deviceID: "1017"
vendor: "15b3"
rootDevices:
- 0000:86:00.0
linkType: ${LINK_TYPE}
deviceType: netdevice
isRdma: true
EOF
```
Expand Down Expand Up @@ -604,7 +604,7 @@ Spiderpool 使用了 [sriov-network-operator](https://github.com/k8snetworkplumb
## 基于 Webhook 自动注入 RDMA 网络资源
Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 labels(`cni.spidernet.io/rdma-resource-inject`) 对一组网卡配置分类。用户只需要为 Pod 添加相同的注解。这样 Spiderpool 会通过 webhook 自动为 Pod 注入所有具有相同 label 的对应的网卡和网络资源。
Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 annotations(`cni.spidernet.io/rdma-resource-inject`) 对一组网卡配置分类。用户只需要为 Pod 添加相同的注解。这样 Spiderpool 会通过 webhook 自动为 Pod 注入所有具有相同 annotation 的对应的网卡和网络资源。
> 该功能仅支持 [ macvlan,ipvlan,sriov,ib-sriov, ipoib ] 这几种 cniType 的网卡配置。
Expand Down Expand Up @@ -639,7 +639,7 @@ Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 labe
metadata:
name: gpu1-sriov
namespace: spiderpool
labels:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-ibsriov
spec:
cniType: ib-sriov
Expand All @@ -651,7 +651,7 @@ Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 labe
EOF
```
> - `cni.spidernet.io/rdma-resource-inject: gpu-ibsriov` 固定的 key,value 为用户自定义。具有相同 Label 和 Value 的一组网卡配置要求 cniType 必须一致。
> - `cni.spidernet.io/rdma-resource-inject: gpu-ibsriov` 固定的 key,value 为用户自定义。
> - `resourceName``ippools` 必须配置,否则 Pod 无法成功注入网络资源。
(2) 对于 Ethernet 网络,请为所有的 GPU 亲和的 SR-IOV 网卡配置 [SR-IOV CNI](https://github.com/k8snetworkplumbingwg/sriov-cni) 配置,并创建对应的 IP 地址池 。 如下例子,配置了 GPU1 亲和的网卡和 IP 地址池
Expand Down Expand Up @@ -685,7 +685,7 @@ Spiderpool 为了简化 AI 应用配置多网卡的复杂度,支持通过 labe
EOF
```
> - `cni.spidernet.io/rdma-resource-inject: gpu-sriov` 固定的 key,value 为用户自定义。具有相同 Label 和 Value 的一组网卡配置要求 cniType 必须一致。
> - `cni.spidernet.io/rdma-resource-inject: gpu-sriov` 固定的 key,value 为用户自定义。
> - `resourceName``ippools` 必须配置,否则 Pod 无法成功注入网络资源。
3. 创建应用时,添加注解: `cni.spidernet.io/rdma-resource-inject: gpu-sriov`,这样 Spiderpool 自动为 Pod 添加 8 个 GPU 亲和网卡的网卡,用于 RDMA 通信,并配置 8 种 RDMA resources 资源:
Expand Down
10 changes: 5 additions & 5 deletions docs/usage/install/ai/get-started-sriov.md
Original file line number Diff line number Diff line change
Expand Up @@ -606,7 +606,7 @@ For clusters using Infiniband networks, if there is a [UFM management platform](
## Auto Inject RDMA Resources base on webhook
To simplify the complexity of configuring multiple network cards for AI applications, Spiderpool supports categorizing a group of network card configurations through labels (cni.spidernet.io/rdma-resource-inject). Users only need to add the same annotation to the Pod. This way, Spiderpool will automatically inject all corresponding network cards and network resources with the same label into the Pod through a webhook.
To simplify the complexity of configuring multiple network cards for AI applications, Spiderpool supports categorizing a group of network card configurations through annotations (cni.spidernet.io/rdma-resource-inject). Users only need to add the same annotation to the Pod. This way, Spiderpool will automatically inject all corresponding network cards and network resources with the same annotation into the Pod through a webhook.
> This feature only supports network card configurations with cniType of [ macvlan,ipvlan,sriov,ib-sriov, ipoib ].
Expand Down Expand Up @@ -641,7 +641,7 @@ To simplify the complexity of configuring multiple network cards for AI applicat
metadata:
name: gpu1-sriov
namespace: spiderpool
labels:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-ibsriov
spec:
cniType: ib-sriov
Expand All @@ -652,7 +652,7 @@ To simplify the complexity of configuring multiple network cards for AI applicat
EOF
```
> - `cni.spidernet.io/rdma-resource-inject: gpu-ibsriov` is a fixed key, and the value is user-defined. A group of network card configurations with the same `Label` and `Value` must have the same `cniType`.
> - `cni.spidernet.io/rdma-resource-inject: gpu-ibsriov` is a fixed key, and the value is user-defined.
> - `resourceName` and `ippools` must be configured, otherwise the Pod will fail to inject network resources successfully.
b. For Ethernet Networks, configure [the SR-IOV CNI](https://github.com/k8snetworkplumbingwg/sriov-cni) for all GPU-affinitized SR-IOV network cards and create the corresponding IP address pool. The following example configures the network card and IP address pool for GPU1
Expand All @@ -674,7 +674,7 @@ To simplify the complexity of configuring multiple network cards for AI applicat
metadata:
name: gpu1-sriov
namespace: spiderpool
labels:
annotations:
cni.spidernet.io/rdma-resource-inject: gpu-sriov
spec:
cniType: sriov
Expand All @@ -686,7 +686,7 @@ To simplify the complexity of configuring multiple network cards for AI applicat
EOF
```
> - `cni.spidernet.io/rdma-resource-inject: gpu-sriov` is a fixed key, and the value is user-defined. A group of network card configurations with the same `Label` and `Value` must have the same `cniType`.
> - `cni.spidernet.io/rdma-resource-inject: gpu-sriov` is a fixed key, and the value is user-defined.
> - `resourceName` and `ippools` must be configured, otherwise the Pod will fail to inject network resources successfully.
3. Add the annotation `cni.spidernet.io/rdma-resource-inject: gpu-sriov` to the Pod, so that Spiderpool automatically adds 8 GPU-affinity network cards for RDMA communication and configures 8 types of RDMA resources:
Expand Down
11 changes: 11 additions & 0 deletions pkg/multuscniconfig/multusconfig_mutate.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,17 @@ func mutateSpiderMultusConfig(ctx context.Context, smc *spiderpoolv2beta1.Spider
if smc.Spec.ChainCNIJsonData == nil {
smc.Spec.ChainCNIJsonData = []string{}
}

// inject the labels
value, ok := smc.Annotations[constant.AnnoPodResourceInject]
if !ok {
return
}

if smc.Labels == nil {
smc.Labels = make(map[string]string)
}
smc.Labels[constant.AnnoPodResourceInject] = value
}

func setMacvlanDefaultConfig(macvlanConfig *spiderpoolv2beta1.SpiderMacvlanCniConfig) {
Expand Down
37 changes: 37 additions & 0 deletions pkg/multuscniconfig/multusconfig_validate.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ func validateCNIConfig(multusConfig *spiderpoolv2beta1.SpiderMultusConfig) *fiel
return field.Invalid(cniTypeField, nil, "CniType must not be nil")
}

_, injectRdmaResource := multusConfig.Annotations[constant.AnnoPodResourceInject]
switch *multusConfig.Spec.CniType {
case constant.MacvlanCNI:
if multusConfig.Spec.MacvlanConfig == nil {
Expand All @@ -108,6 +109,12 @@ func validateCNIConfig(multusConfig *spiderpoolv2beta1.SpiderMultusConfig) *fiel
return field.Forbidden(cniTypeField, fmt.Sprintf("the cniType %s only supports %s, please remove other CNI configs", *multusConfig.Spec.CniType, macvlanConfigField.String()))
}

if injectRdmaResource {
if err := ValidateRdmaResouce(multusConfig.Spec.MacvlanConfig.EnableRdma, multusConfig.Name, multusConfig.Namespace, multusConfig.Spec.MacvlanConfig.RdmaResourceName, multusConfig.Spec.MacvlanConfig.SpiderpoolConfigPools); err != nil {
return field.Invalid(macvlanConfigField, *multusConfig.Spec.MacvlanConfig, err.Error())
}
}

case constant.IPVlanCNI:
if multusConfig.Spec.IPVlanConfig == nil {
return field.Required(ipvlanConfigField, fmt.Sprintf("no %s specified", ipvlanConfigField.String()))
Expand All @@ -127,6 +134,12 @@ func validateCNIConfig(multusConfig *spiderpoolv2beta1.SpiderMultusConfig) *fiel
return field.Forbidden(cniTypeField, fmt.Sprintf("the cniType %s only supports %s, please remove other CNI configs", *multusConfig.Spec.CniType, ipvlanConfigField.String()))
}

if injectRdmaResource {
if err := ValidateRdmaResouce(multusConfig.Spec.IPVlanConfig.EnableRdma, multusConfig.Name, multusConfig.Namespace, multusConfig.Spec.IPVlanConfig.RdmaResourceName, multusConfig.Spec.IPVlanConfig.SpiderpoolConfigPools); err != nil {
return field.Invalid(ipvlanConfigField, *multusConfig.Spec.IPVlanConfig, err.Error())
}
}

case constant.SriovCNI:
if multusConfig.Spec.SriovConfig == nil {
return field.Required(sriovConfigField, fmt.Sprintf("no %s specified", sriovConfigField.String()))
Expand All @@ -152,6 +165,12 @@ func validateCNIConfig(multusConfig *spiderpoolv2beta1.SpiderMultusConfig) *fiel
return field.Forbidden(cniTypeField, fmt.Sprintf("the cniType %s only supports %s, please remove other CNI configs", *multusConfig.Spec.CniType, sriovConfigField.String()))
}

if injectRdmaResource {
if err := ValidateRdmaResouce(multusConfig.Spec.SriovConfig.EnableRdma, multusConfig.Name, multusConfig.Namespace, multusConfig.Spec.SriovConfig.ResourceName, multusConfig.Spec.SriovConfig.SpiderpoolConfigPools); err != nil {
return field.Invalid(sriovConfigField, *multusConfig.Spec.SriovConfig, err.Error())
}
}

case constant.IBSriovCNI:
if multusConfig.Spec.IbSriovConfig == nil {
return field.Required(ibsriovConfigField, fmt.Sprintf("no %s specified", ibsriovConfigField.String()))
Expand All @@ -165,6 +184,12 @@ func validateCNIConfig(multusConfig *spiderpoolv2beta1.SpiderMultusConfig) *fiel
return field.Forbidden(cniTypeField, fmt.Sprintf("the cniType %s only supports %s, please remove other CNI configs", *multusConfig.Spec.CniType, sriovConfigField.String()))
}

if injectRdmaResource {
if err := ValidateRdmaResouce(true, multusConfig.Name, multusConfig.Namespace, multusConfig.Spec.IbSriovConfig.ResourceName, multusConfig.Spec.IbSriovConfig.SpiderpoolConfigPools); err != nil {
return field.Invalid(ibsriovConfigField, *multusConfig.Spec.IbSriovConfig, err.Error())
}
}

case constant.IPoIBCNI:
if multusConfig.Spec.IpoibConfig == nil {
return field.Required(ipoibConfigField, fmt.Sprintf("no %s specified", ipoibConfigField.String()))
Expand All @@ -178,7 +203,16 @@ func validateCNIConfig(multusConfig *spiderpoolv2beta1.SpiderMultusConfig) *fiel
return field.Forbidden(cniTypeField, fmt.Sprintf("the cniType %s only supports %s, please remove other CNI configs", *multusConfig.Spec.CniType, sriovConfigField.String()))
}

if injectRdmaResource {
if err := ValidateRdmaResouce(true, multusConfig.Name, multusConfig.Namespace, multusConfig.Spec.IpoibConfig.Master, multusConfig.Spec.IpoibConfig.SpiderpoolConfigPools); err != nil {
return field.Invalid(ipoibConfigField, *multusConfig.Spec.IpoibConfig, err.Error())
}
}

case constant.OvsCNI:
if injectRdmaResource {
return field.Forbidden(cniTypeField, fmt.Sprintf("the cniType %s does not support RDMA resource injected", *multusConfig.Spec.CniType))
}
if multusConfig.Spec.OvsConfig == nil {
return field.Required(ovsConfigField, fmt.Sprintf("no %s specified", ovsConfigField.String()))
}
Expand Down Expand Up @@ -216,6 +250,9 @@ func validateCNIConfig(multusConfig *spiderpoolv2beta1.SpiderMultusConfig) *fiel
}

case constant.CustomCNI:
if injectRdmaResource {
return field.Forbidden(cniTypeField, fmt.Sprintf("the cniType %s does not support RDMA resource injected", *multusConfig.Spec.CniType))
}
// multusConfig.Spec.CustomCNIConfig can be empty
if checkExistedConfig(&(multusConfig.Spec), constant.CustomCNI) {
return field.Forbidden(cniTypeField, fmt.Sprintf("the cniType %s only supports %s, please remove other CNI configs", *multusConfig.Spec.CniType, customCniConfigField.String()))
Expand Down
21 changes: 21 additions & 0 deletions pkg/multuscniconfig/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import (
coordinatorcmd "github.com/spidernet-io/spiderpool/cmd/coordinator/cmd"
spiderpoolcmd "github.com/spidernet-io/spiderpool/cmd/spiderpool/cmd"
"github.com/spidernet-io/spiderpool/pkg/constant"
"github.com/spidernet-io/spiderpool/pkg/k8s/apis/spiderpool.spidernet.io/v2beta1"
spiderpoolv2beta1 "github.com/spidernet-io/spiderpool/pkg/k8s/apis/spiderpool.spidernet.io/v2beta1"
)

Expand Down Expand Up @@ -247,3 +248,23 @@ func ResourceName(smc *spiderpoolv2beta1.SpiderMultusConfig) string {
}
return ""
}

func ValidateRdmaResouce(enableRdma bool, name, namespace, rdmaResourceName string, ippools *v2beta1.SpiderpoolPools) error {
if !enableRdma {
return fmt.Errorf("spidermultusconfig %s/%s not enable RDMA", namespace, name)
}

if rdmaResourceName == "" {
return fmt.Errorf("rdmaResourceName can not empty for spidermultusconfig %s/%s", namespace, name)
}

if ippools == nil {
return fmt.Errorf("No any ippools configured for spidermultusconfig %s/%s", namespace, name)
}

if len(ippools.IPv4IPPool)+len(ippools.IPv6IPPool) == 0 {
return fmt.Errorf("No any ippools configured for spidermultusconfig %s/%s", namespace, name)
}

return nil
}
Loading

0 comments on commit 7fdd696

Please sign in to comment.