The node-maintenance-operator (NMO) is an operator generated from the operator-sdk. NMO was previously developed under KubeVirt, and this repository is the up-to-date version of NMO.
The purpose of this operator is to watch for new or deleted custom resources (CRs) called NodeMaintenance
which indicate that a node in the cluster should either:
CR created: move node into maintenance, cordon the node - set it as unschedulable, and evict the pods (which can be evicted) from that node.NodeMaintenance
CR deleted: remove pod from maintenance and uncordon the node - set it as schedulable.
Note: The current behavior of the operator is to mimic
kubectl drain <node name>
There are three ways to run the operator:
- Deploy the latest version, which was built from the
branch, to a running OpenShift/Kubernetes cluster. - Deploy the last release version from OperatorHub to a running Kubernetes cluster.
- Build and deploy from sources to a running or to be created OpenShift/Kubernetes cluster.
After every PR merge to main
branch images were build and pushed to
For deployment of NMO using these images you need:
- a running OpenShift cluster, or a Kubernetes cluster with Operator Lifecycle Manager (OLM) installed.
binary installed, see a valid
configured to access your cluster.
Then run operator-sdk run bundle
Click on Install
in the Node Maintenance Operator page under,
and follow its instructions to install the Operator Lifecycle Manager (OLM), and the operator.
Follow the instructions here for deploying the operator with OLM.
Note: Webhook cannot run using
make deploy
, because the volume mount of the webserver certificate is not found.
To set maintenance on a node a NodeMaintenance
custom resource should be created.
The NodeMaintenance
CR spec contains:
- nodeName: The name of the node which will be put into maintenance mode.
- reason: The reason why the node will be under maintenance.
Create the example NodeMaintenance
CR found at config/samples/nodemaintenance_v1beta1_nodemaintenance.yaml
$ cat config/samples/nodemaintenance_v1beta1_nodemaintenance.yaml
kind: NodeMaintenance
name: nodemaintenance-sample
nodeName: node02
reason: "Test node maintenance"
$ kubectl apply -f config/samples/nodemaintenance_v1beta1_nodemaintenance.yaml
$ kubectl logs <nmo-pod-name>
022-02-23T07:33:58.924Z INFO controller-runtime.manager.controller.nodemaintenance Reconciling NodeMaintenance {"reconciler group": "", "reconciler kind": "NodeMaintenance", "name": "nodemaintenance-sample", "namespace": ""}
2022-02-23T07:33:59.266Z INFO controller-runtime.manager.controller.nodemaintenance Applying maintenance mode {"reconciler group": "", "reconciler kind": "NodeMaintenance", "name": "nodemaintenance-sample", "namespace": "", "node": "node02", "reason": "Test node maintenance"}
time="2022-02-24T11:58:20Z" level=info msg="Maintenance taints will be added to node node02"
time="2022-02-24T11:58:20Z" level=info msg="Applying taint add on Node: node02"
time="2022-02-24T11:58:20Z" level=info msg="Patching taints on Node: node02"
2022-02-23T07:33:59.336Z INFO controller-runtime.manager.controller.nodemaintenance Evict all Pods from Node {"reconciler group": "", "reconciler kind": "NodeMaintenance", "name": "nodemaintenance-sample", "namespace": "", "nodeName": "node02"}
E0223 07:33:59.498801 1 nodemaintenance_controller.go:449] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-jrprj, openshift-dns/dns-default-kf6jj, openshift-dns/node-resolver-72jzb, openshift-image-registry/node-ca-czgc6, openshift-ingress-canary/ingress-canary-44tgv, openshift-machine-config-operator/machine-config-daemon-csv6c, openshift-monitoring/node-exporter-rzwhz, openshift-multus/multus-additional-cni-plugins-829bh, openshift-multus/multus-qwfc9, openshift-multus/network-metrics-daemon-pxt6n, openshift-network-diagnostics/network-check-target-qqcbr, openshift-sdn/sdn-s5cqx; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: openshift-marketplace/nmo-downstream-8-8nms7
I0223 07:33:59.500418 1 nodemaintenance_controller.go:449] evicting pod openshift-network-diagnostics/network-check-source-865d4b5578-n2cxg
I0223 07:33:59.500790 1 nodemaintenance_controller.go:449] evicting pod openshift-ingress/router-default-7548cf6fb5-rgxrq
I0223 07:33:59.500944 1 nodemaintenance_controller.go:449] evicting pod openshift-marketplace/12a4cfa0c2be01867daf1d9b7ad7c0ae7a988fd957a2ad6df0d72ff6875lhcx
I0223 07:33:59.501061 1 nodemaintenance_controller.go:449] evicting pod openshift-marketplace/nmo-downstream-8-8nms7
To remove maintenance from a node, delete the corresponding NodeMaintenance
(or nm
which is a shortName) CR:
$ kubectl delete nm nodemaintenance-sample "nodemaintenance-sample" deleted
$ kubectl logs <nmo-pod-name>
2022-02-24T14:27:35.332Z INFO controller-runtime.manager.controller.nodemaintenance Reconciling NodeMaintenance {"reconciler group": "", "reconciler kind": "NodeMaintenance", "name": "nodemaintenance-sample", "namespace": ""}
time="2022-02-24T14:27:35Z" level=info msg="Maintenance taints will be removed from node node02"
time="2022-02-24T14:27:35Z" level=info msg="Applying taint remove on Node: node02"
The NodeMaintenance
CR can contain the following status fields:
$ kubectl get nm nodemaintenance-sample -o yaml
kind: NodeMaintenance
name: nodemaintenance-sample
nodeName: node02
reason: Test node maintenance
drainProgress: 40
evictionPods: 5
lastError: "Last failure message"
lastUpdate: "2022-06-23T11:43:18Z"
- pod-A
- pod-B
- pod-C
phase: Running
totalpods: 19
shows the percentage completion of draining the node.
is the total number of pods up for eviction, before the node entered maintenance mode.
represents the latest error if any for the latest reconciliation.
is the last time the status has been updated.
is a list of pending pods for eviction.
is the representation of the maintenance progress and can hold a string value of: Running|Succeeded.
The phase is updated for each processing attempt on the CR.
is the total number of pods, before the node entered maintenance mode.
Use NMO's must-gather from here to collect related debug data.
make check
- Deploy the operator as explained above
- run
make cluster-functest
For new minor releases:
- create and push the
branch. - update OpenshiftCI with new branches!
For every major / minor / patch release:
- create and push the
tag. - this should trigger CI to build and push new images
- if it fails, the manual fallback is
VERSION=x.y.z make container-build-and-push-community
- make the git tag a release in the GitHub UI.
Feel free to join our Google group to get more info -