Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect IPConflicting and gatewayReachable in ipam rather than coordinator #4560

Merged
merged 1 commit into from
Jan 26, 2025

Conversation

cyclinder
Copy link
Collaborator

@cyclinder cyclinder commented Jan 15, 2025

Thanks for contributing!

Notice:

What issue(s) does this PR fix:

Fixes #4582

Special notes for your reviewer:

作为 #4474 的后续修复:

  1. 确保 IP 冲突检测先于网关检测,这防止当 IP 冲突检测失败且网关检测发送 ARP 探测报时意外更新了错误的 ARP 缓存表,导致通信失败。
  2. 移动 IP 冲突检测和网关检测的逻辑到 IPAM 中完成,在之前是由 Coordinator 完成。之前检测到 IP 冲突的时候,冲突的 IP 已经配置到了网卡上,可能会导致 IP 偶现不可通信的情况。
  3. 在 IPAM 中完成 IP 冲突检测和网关检测,这可能会增加 IPAM 调用的时间,在同时检测 IPv4 和 IPv6 的情况下,大概增加 1 s,具体时间视网络而定。特别对于 IPv6,开启 Duplicate Address Detection(DAD)的情况下,内核会检查本地链路地址是否冲突,这可能会花费一段时间。

As a follow-up fix to spidernet-io/spiderpool#4474:

  • Ensure that IP conflict detection precedes gateway detection. This prevents the ARP cache table from being incorrectly updated when IP conflict detection fails and gateway detection sends ARP probe packets, which could lead to communication failures.
  • Move the logic for IP conflict detection and gateway detection into IPAM, which was previously handled by the Coordinator. Previously, when an IP conflict was detected, the conflicting IP was already configured on the network interface card, potentially causing occasional IP communication failures.
  • Performing IP conflict detection and gateway detection within IPAM may increase the time for IPAM calls. When detecting both IPv4 and IPv6, the time may increase by approximately 1 second, depending on the network. Particularly for IPv6, when Duplicate Address Detection (DAD) is enabled, the kernel will check for conflicts with local link addresses, which may take some time.

@cyclinder cyclinder added the release/feature-new release note for new feature label Jan 15, 2025
@cyclinder cyclinder marked this pull request as draft January 15, 2025 07:54
@@ -110,6 +110,29 @@ paths:
description: Forbid to release IPs for stateful workload
schema:
$ref: "#/definitions/Error"
"/ipam/ip-detection-configs":
Copy link
Collaborator

@weizhoublue weizhoublue Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我建议 不新加接口,而是在 原来 分配 ip 的 接口中 增加 两个 字段,表示是否要 进行 2 个 ip 的检测

  1. 减少 api 调用,耦合配置在一起
  2. 未来 如果要基于 ippool crd 中配置 子网级别的检测 开关,新增 这个api,会导致 获取 该 ip或者所属子网 的 检测开关 非常难实现 , 因为 它 拆开了 ip 和 该ip 是否要检测 的 2个 强关联 信息

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我当时也纠结一阵,本意不想耦合 ipam response 的 配置,已更新

@weizhoublue
Copy link
Collaborator

应该还有一些变更
1 chart
2 crd 定义删减

@cyclinder cyclinder force-pushed the ipam/dectetion branch 2 times, most recently from 392a3f7 to 4f90cd4 Compare January 16, 2025 12:02
@cyclinder cyclinder marked this pull request as ready for review January 16, 2025 12:03
@cyclinder
Copy link
Collaborator Author

应该还有一些变更
1 chart
2 crd 定义删减

chart 已改, CRD 我想在另一个 PR 中统一变更

@cyclinder cyclinder force-pushed the ipam/dectetion branch 4 times, most recently from 4ee161a to 7230ca0 Compare January 16, 2025 15:40
@weizhoublue
Copy link
Collaborator

应该还有一些变更
1 chart
2 crd 定义删减

chart 已改, CRD 我想在另一个 PR 中统一变更

如果 和这个强相关的 修改 没有一起,可能会导致 v0.9 和 v1 很难 cherry pick
我想要 CRD 在另一个统一改后,那个 pr 是 不能 cherry pick 到 v0.9 的

@weizhoublue weizhoublue changed the title Detect IPConflicting and gatewayReachable in ipam without coordinator Detect IPConflicting and gatewayReachable in ipam rather than coordinator Jan 17, 2025
@weizhoublue
Copy link
Collaborator

这个应该有相关的 doc 的 修改

@cyclinder cyclinder force-pushed the ipam/dectetion branch 3 times, most recently from 52520d8 to 8bbd659 Compare January 21, 2025 11:17
@weizhoublue weizhoublue added release/bug and removed release/feature-new release note for new feature labels Jan 22, 2025
@cyclinder cyclinder force-pushed the ipam/dectetion branch 6 times, most recently from 4fb933f to bd2a188 Compare January 22, 2025 08:48
@@ -32,15 +32,14 @@ Let's delve into how coordinator implements these features.
| tunePodRoutes | Tune the pod's routing tables while a pod is in multi-NIC mode | bool | optional | true |
| podDefaultRouteNic | Configure the default routed NIC for the pod while a pod is in multi-NIC mode, The default value is 0, indicate that the first network interface of the pod has the default route. | string | optional | "" |
| podDefaultCniNic | The name of the pod's first NIC defaults to eth0 in kubernetes | bool | optional | eth0 |
| detectGateway | Enable gateway detection while creating pods, which prevent pod creation if the gateway is unreachable | bool | optional | false |
| detectIPConflict | Enable IP conflicting checking for pods, which prevent pod creation if the pod's ip is conflicting | bool | optional | false |
| detectGateway | DEPRECATED: Enable gateway detection while creating pods, which prevent pod creation if the gateway is unreachable | bool | optional | false |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 DEPRECATED 描述,更重要的是,在 CRD 定义的 yaml 中的 description,应该也提到

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRD 描述是有的,这个是 coordinator 的 配置文件,标记为 DEPRECATED 应该没问题,实际上字段已经被移除

@cyclinder cyclinder force-pushed the ipam/dectetion branch 3 times, most recently from cc1b766 to be0d2e2 Compare January 23, 2025 13:23
@@ -59,12 +59,13 @@ spec:
properties:
detectGateway:
default: false
description: DetectGateway to detect the gateway for the pod
description: DetectGateway to detect the gateway for the pod Deprecated,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        vlan:
                          default: 0
                          description: 'DEPRECATED: Vlan is deprecated.'

这是另一个 字段的描述,请对齐所有的格式,不要出现好几种风格

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return fmt.Errorf("%w: pod's interface %s with an conflicting ip %s, %s is located at %s",
constant.ErrIPConflict, d.iface, d.ip4.String(), d.ip4.String(), senderMAC.String())
}
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里如果没有 IP 冲突,continue ? 再尝试发送 arp ?


// When DAD(Duplicate Address Detection) is enanled, the kernel will check if this local link address is in conflict,
// this may take a while, set the maximum timeout to 10s
ndpReady := false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listen 之前,先 判断 先看 d.enableIPv6ConflictDetection ,是否要花这个时间

@@ -125,6 +128,35 @@ func CmdAdd(args *skel.CmdArgs) (err error) {
return err
}

// do ip conflict and gateway detection
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面这么多逻辑的前提,是要先判断 是否开启了冲突检测和网关检测

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在这个函数里面做了判断,也做到尽量提前返回

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看那个函数里 还有做一大堆的 遍历 后 再 判断,这里依据简单的 if ,是否就 先 省了一些 流程

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

必须做遍历 分配的 IPConfigs 才能知道是否做检测,这一点我之前就考虑到了,已经做到尽量提前判断了

@@ -50,8 +53,8 @@ func CmdAdd(args *skel.CmdArgs) (err error) {
return fmt.Errorf("failed to load CNI network configuration: %v", err)
}

logger, err = setupFileLogging(conf)
if nil != err {
logger, err = SetupFileLogging(conf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果开启了冲突检测,第一步 就先把 网卡 up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

开启冲突检测一定要等 IP 分配完之后才知道

Copy link
Collaborator

@weizhoublue weizhoublue Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那可以无论是否要开启,都 up 网卡,应该没毛病
这样 可能能节省 一半的时间

// CNI will set disable_ipv6 to 0 AFTER ipam is invoked
// in order to detect IPv6 address conflicts and gateway reachability
// we should ensure disable_ipv6 is 0
if err := sysctl.EnableIPv6ForInterfaces([]string{"lo", d.iface}); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个时间可能是不需要花费的,并且意义不大
如果系统本身 不能 ipv6 接口工作,有很多参数影响,不是简单这一个参数,主机上的网卡就不能 ipv6 工作
另外,如果有问题,后续 cni 也无法把 ip 配置上去

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这其实不是为 ipv6, 我是参考 sriov 的实践,它发送免费 arp/ndp 前 会等待 网卡就绪

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已移除

@cyclinder cyclinder force-pushed the ipam/dectetion branch 8 times, most recently from 1ed0bd2 to 189192d Compare January 25, 2025 00:48
@cyclinder cyclinder merged commit 655c389 into spidernet-io:main Jan 26, 2025
58 of 59 checks passed
@cyclinder cyclinder deleted the ipam/dectetion branch January 26, 2025 06:57
cyclinder added a commit to cyclinder/spiderpool that referenced this pull request Jan 26, 2025
cyclinder added a commit to cyclinder/spiderpool that referenced this pull request Jan 26, 2025
cyclinder added a commit that referenced this pull request Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Coordinator does IP conflict gateway detection which can lead to potential communication problems.
2 participants