diff --git a/doc/design/parallel-node-config.md b/doc/design/parallel-node-config.md new file mode 100644 index 000000000..60b9eee0e --- /dev/null +++ b/doc/design/parallel-node-config.md @@ -0,0 +1,245 @@ +--- +title: Parallel SR-IOV configuration +authors: + - e0ne +reviewers: + - adrianchiris + - SchSeba +creation-date: 18-07-2023 +last-updated: 18-07-2023 +--- + +# Parallel SR-IOV configuration + +## Summary +Allow SR-IOV Network Operator to configure more than one node at the moment. + +## Motivation +SR-IOV Network Operator configures SR-IOV one node at a time and one nic at a same time. That means we’ll need to wait +hours or even days to configure all NICs on large cluster deployments. With multi-NICs deployments Operator configures +NICs one by one on each node which leads to a lot of unnecessary node drain calls during SR-IOV configuration. Also +moving all draining logic to a centralized place which will reduce chances of race conditions and bugs that were +encountered before in sriov-network-config-daemon with draining. + +### Use Cases + +### Goals +* Number of drainable nodes should be 1 bн default +* Number of drainable nodes should be configured by pool +* Nodes pool should be defined by node selector +* A node could be included into several pools. In this case pool should be + selected by it’s priority like it’s implemented for SriovNetworkNodePolicy +* Move all drain-related logic into the centralized place + + +### Non-Goals +Parallel NICs configuration on the same node is out of scope of this proposal + +## Proposal + +Introduce nodes pool drain configuration and controller to meet Goals targets. + + +### Workflow Description +A new Drain controller will be introduced to manage node drain and cordon procedures. That means we don't need to do +drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states: + +```golang +DrainIdle = "Idle" +DrainRequired = "Drain_Required" +DrainMcpPaused = "Draining_MCP_Paused" +Draining = "Draining" +DrainingComplete = "Draining_Complete" +``` + +Drain controller will watch for node annotation, `SriovNetworkNodeState` and `SriovNetworkPoolConfig` changes: +```golang +// +kubebuilder:rbac:groups="",resources=nodes,verbs=get;list;watch;update;patch +// +kubebuilder:rbac:groups="",resources=SriovNetworkPoolConfig,verbs=get;list;watch;update;patch +// +kubebuilder:rbac:groups="",resources=SriovNetworkNodeState,verbs=get;list;watch;update;patch +``` + +Once `SriovNetworkNodeState.DrainStatus` will be marked as `Drain_Required` by config daemon controller add will start +draining procedure. Reconcile loop will check if current `Draining` nodes count is greater or equal than +`maxUnavailable` and will mark it as `Draining` to start draining procedure. If current Draining nodes count equals to +`maxUnavailable`, no `SriovNetworkNodeState` updates will be applied in the current reconcile loop. + +Once draining and configuration will be finished config daemon will update `SriovNetworkNodeState.DrainStatus` to +`Idle`. + +We can add a webhook to block the change of the configuration when the nodes are draining to a lower number. Without +validating webhook controller will proceed drain procedure for the current `Draining` nodes until the count of them +will decrease to specified `maxUnavailable` number. + +We need this logic in controller to have it centralized and not to have race condition issues during +`SriovNetworkNodeStates` update. + +Config daemon will be responsible for setting `SriovNetworkNodeState.DrainStatus=Drain_Required` and +`SriovNetworkNodeState.DrainStgatus=DrainComplete` only. It will simplify its implementation. + +### API Extensions + +#### Extend existing CR SriovNetworkNodeState +SriovNetworkNodeState already contains current state of the configuration progress in a `status` field, so we don't need +to store drain status in node annotation. `SriovNetworkNodeStateStatus` will be extended to contain drain progress +instead of node annotation: + +```golang +// SriovNetworkNodeStateStatus defines the observed state of SriovNetworkNodeState +type SriovNetworkNodeStateStatus struct { + Interfaces InterfaceExts `json:"interfaces,omitempty"` + // +kubebuilder:validation:Enum=Idle;Draining,Draining_MCP_Paused,Succeeded,Failed,InProgress + DrainStatus string `json:"drainStatus,omitempty"` + SyncStatus string `json:"syncStatus,omitempty"` + LastSyncError string `json:"lastSyncError,omitempty"` +} +``` + +#### Extend existing CR SriovNetworkPoolConfig +SriovNetworkPoolConfig is used only for OpenShift to provide configuration for +OVS Hardware Offloading. We can extend it to add configuration for the drain +pool. E.g.: + +```golang +// DrainConfigSpec defines node pool drain configuration +type DrainConfigSpec struct { + // Number of nodes can be configured in parallel + // 0 means no limit, all nodes will be configured in parallel + // +kubebuilder:default:=1 + // +kubebuilder:validation:Minimum:=0 + MaxParallelNodeConfiguration int `json:"maxParallelNodeConfiguration,omitempty"` +} + +// SriovNetworkPoolConfigSpec defines the desired state of SriovNetworkPoolConfig +type SriovNetworkPoolConfigSpec struct { + // +kubebuilder:validation:Minimum=0 + // +kubebuilder:validation:Maximum=99 + // Priority of the SriovNetworkPoolConfig, higher priority policies can override lower ones. + Priority int `json:"priority,omitempty"` + // OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes + OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"` + // NodeSelectorTerms is a list of node selectors to apply SriovNetworkPoolConfig + NodeSelectorTerms *v1.NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` + DrainConfig DrainConfigSpec `json:"drainConfig,omitempty"` +} +``` + +```yaml +apiVersion: v1 +kind: SriovNetworkPoolConfig +metadata: +name: pool-1 +namespace: network-operator +spec: +priority: 44 +drainConfig: + maxParallelNodeConfiguration: 5 +nodeSelectorTerms: + - matchExpressions: + - key: some-label + operator: In + values: + - val-2 + - matchExpressions: + - key: other-label + operator: "Exists" + DrainConfig: + maxUnavailable: 5 + ovsHardwareOffloadConfig: {} +``` + +Default pool should be always created with the lowest priority: +```yaml +apiVersion: v1 +kind: SriovNetworkPoolConfig +metadata: +name: default +namespace: network-operator +spec: +priority: 0 +DrainConfig: +maxUnavailable: 1 +``` + + +### Implementation Details/Notes/Constraints +The implementation of this feature is very complex and requires a lot of changes to different parts of SR-IOV Network +Operator. To not introduce breaking changes we have to split this effort to several phases: +* implement new Drain controller without user-facing API changes: + it will proceed only one node configuration at the same time and doesn't require API changes +* drop `SriovNetworkNodeState` usage and move this annotation values into the `SriovNetworkNodeState` object +* introduce new API changes to support pool of nodes to be drained in a parallel: + at this phase we introduce new fields in SriovNetworkPoolConfig and modify Drain controller to watch for + the specified CRs and proceed drain in a parallel per node pool configuration + +All phases should be implemented one-by-one in a separate PRs in the order above. + +### Upgrade & Downgrade considerations +After operator upgrade we have to support `sriovnetwork.openshift.io/state` node annotation to +`SriovNetworkNodeState.DrainStatus` migration. This logic will be implemented in a Drain controller and should be +supported at least until next major version of the operator. + +Node reconcile loop will check for `sriovnetwork.openshift.io/state` annotation and move its value to +`SriovNetworkNodeState.DrainStatus` field. `sriovnetwork.openshift.io/state` annotation will be deleted by operator once +`SriovNetworkNodeState` is updated. + +### Alternative APIs +#### Option 1: extend SriovOperatorConfig CRD +We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: + +```yaml +apiVersion: sriovnetwork.openshift.io/v1 +kind: SriovOperatorConfig +metadata: +name: default +namespace: network-operator +spec: +# Add fields here +enableInjector: false +enableOperatorWebhook: false +configDaemonNodeSelector: {} +disableDrain: false +drainConfig: +- name: default + maxUnavailable: 1 + priority: 0 # the lowest priority +- name: pool-1 + maxUnavailable: 5 + priority: 44 + # empty nodeSelectorTerms means 'all nodes' + nodeSelectorTerms: + - matchExpressions: + - key: some-label + operator: In + values: + - val-1 + - val-2 + - matchExpressions: + - key: other-label + operator: "Exists" +``` + +#### Option 2: New CRD +Add new `DrainConfiguration`CRD with fields mentioned in previous options. +We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: +```yaml +apiVersion: sriovnetwork.openshift.io/v1 +kind: SriovDrainConfig +metadata: +name: default +namespace: network-operator +spec: +maxUnavailable: 1 +priority: 0 # the lowest priority +# empty nodeSelectorTerms means 'all nodes' +nodeSelectorTerms: +- matchExpressions: +- key: some-label +operator: In +``` + +### Test Plan +* Unit tests will be implemented for new Drain Controller. +** E2E, manual or automation functional testing should have such test cases: +** to verify that we actually configure SR-IOV on `maxUnavailable` nodes at the same time +** to check that we don't configure more than `maxUnavailable` nodes at the same time