Kubernetes Master Class: A Seamless Approach to Rancher & Kubernetes Upgrades
- Rancher Server is a set of pods that run the main orchestration engine and UI for Rancher.
- RKE (Rancher Kubernetes Engine) is the tool Rancher uses to create and manage Kubernetes clusters
- Local/upstream cluster This is the cluster where the Rancher server is installed; this is usually an RKE built cluster)
- Downstream cluster(s) are Kubernetes cluster that Rancher is managing
The following are the high-level rules for planning a Rancher/Kubernetes/Docker upgrade.
- Do not rush an upgrade.
- Do not stack upgrades (We recommended at least 24hours between upgrades)
- Make sure you have a good backup
- The recommended order of upgrades is Rancher, Kubernetes, and then Docker.
- All upgrades should be tested in a lab or non-prod environment before being deployed to Production.
- Review all release notes link
- Review the support matrix link
- It is not required, but we recommended pausing any CI/CD pipelines using the Rancher API during an upgrade.
Please see the following recommendations when planning version upgrades.
- Rancher: perform one minor version jump at a time For example: when upgrading from v2.1.x -> v2.3.x. We encourage upgrading v2.1.x -> v2.2.x -> v2.3.x but this is not required.
- Kubernetes: perform no more than two minor versions at a time, ideally avoid skipping minor versions entirely as this can increase the chances of an issue due to accumulated changes For example: when upgrading from v1.13.x -> v1.19.x we encourage upgrading v1.13.x -> v1.15.x -> v1.17.x -> v1.19.x
- RKE: perform one minor RKE versions jump at a time For example: when upgrading from v0.1.x -> v1.1.0 instead do v0.1.x -> v0.2.x -> v.0.3.x -> v1.0.x -> v1.1.x
- Rancher upgrade - 30Mins for install with 30mins for rollback
- Kubernetes upgrade - 60Mins for install which may be longer for larger clusters with 60Mins for troubleshooting/rollback
- Rancher upgrade - Only management of Rancher and downstream clusters are impacted; applications shouldn't know that anything is being done. But any CI/CD pipelines should be paused.
- Kubernetes upgrade of the local cluster - The Rancher UI should disconnect and reconnect after a few mins due to the ingress-controllers being restarted.
- Kubernetes upgrade of downstream clusters - Applications might see a short network blip as ingress-controllers and networking is restarted. See link for more details
- Rancher upgrade - A maintenance window is not required, but CI/CD pipelines should be paused.
- Kubernetes upgrade of the local cluster - A maintenance window is not needed, but CI/CD pipelines should be paused.
- Kubernetes upgrade of downstream clusters - This should be done during a maintenance window or a quiet time
- Check if the Rancher UI is accessible
- Check if all clusters in UI are in an Active state
- Check if all pods in kube-system and cattle-system namespaces are running in both the local and downstream clusters.
kubectl get pods -n kube-system kubectl get pods -n cattle-system
- Verify etcd has scheduled snapshots configured, and these are working.
- RKE: if Rancher is deployed on a Kubernetes cluster built with RKE, verify etcd snapshots are enabled and working, on etcd nodes you can confirm with the following:
ls -l /opt/rke/etcd-snapshots docker logs etcd-rolling-snapshots
- k3s: if Rancher is deployed on a k3s Kubernetes cluster, ensure scheduled backups are configured and working. Please see the k3s Documentation pages for further information on this.
- RKE: if Rancher is deployed on a Kubernetes cluster built with RKE, verify etcd snapshots are enabled and working, on etcd nodes you can confirm with the following:
- Create a one-time datastore snapshot; please see the following Documentation for RKE and k3s and the single node Docker install options for more information
- RKE: check for expired/expiring Kubernetes certs
for i in $(ls /etc/kubernetes/ssl/*.pem|grep -v key); do echo -n $i" "; openssl x509 -startdate -enddate -noout -in $i | grep 'notAfter='; done
- RKE: check for expired/expiring Kubernetes certs
- Update helm repo cache
helm repo update helm fetch rancher-stable/rancher
- Verify you’re connected to the correct cluster
kubectl get nodes -o wide
- Take an etcd snapshot
rke etcd snapshot-save --config cluster.yaml --name pre-rancher-upgrade-`date '+%Y%m%d%H%M%S'`
- Grab the current helm values using helm get values rancher -n cattle-system
Example output:
USER-SUPPLIED VALUES: antiAffinity: required auditLog: level: 2 hostname: rancher.example.com ingress: tls: source: secret
- Use the values to build your upgrade command
NOTE: The only thing you should change is the version flag.
helm upgrade --install rancher rancher-stable/rancher \ --namespace cattle-system \ --set hostname=rancher.example.com \ --set ingress.tls.source=secret \ --set auditLog.level=2 \ --set antiAffinity=required \ --version 2.5.5
- Wait for the upgrade to finish
kubectl -n cattle-system rollout status deploy/rancher
- Official Rancher upgrade Documentation
- Check if the Rancher UI is accessible
- Check if all clusters in UI are in an Active state
- Check if all pods in kube-system and cattle-system namespaces are running in both the local and downstream clusters.
kubectl get pods -n kube-system kubectl get pods -n cattle-system
- Verify new Rancher version (Bottom Left corner)
- Verify all Rancher, cattle-cluster-agent, and cattle-node-agent is running on the new version on the local cluster
kubectl get pods -n cattle-system -o wide
- Verify all downstream cluster are Active
- Verify all Rancher, cattle-cluster-agent, and cattle-node-agent runs on the new version on all downstream clusters.
kubectl get pods -n cattle-system -o wide
Take a post-upgrade etcd snapshot
rke etcd snapshot-save --config cluster.yaml --name post-rancher-upgrade-`date '+%Y%m%d%H%M%S'`
- You can not downgrade Rancher; you must do an etcd restore Documentation
rke etcd snapshot-restore --name pre-rancher-upgrade-..... --config ./cluster.yaml
- Verify the correct
cluster.yaml
andcluster.rkestate
file - Verify SSH access to all nodes in the cluster
- Verify all nodes are Ready
kubectl get nodes -o wide
- Verify all pods are Healthy
kubectl get pods --all-namespaces -o wide | grep -v 'Running\|Completed'
- We're looking for Pods crashing or stuck.
- Verify Kubernetes version is available in RKE
rke config --list-version --all –print
- You might need to upgrade to a newer RKE version if the recommend k8s version isn't available.
- Take an etcd snapshot
rke etcd snapshot-save --config cluster.yaml --name pre-k8s-upgrade-`date '+%Y%m%d%H%M%S'`
- Change
kubernetes_version
in thecluster.yaml
kubernetes_version: "1.19.7-rancher1-1"
- If you have an air-gapped setup, please see Documentation
- Verify all nodes are Ready and at the new version
- Verify all pods are Healthy
kubectl get pods --all-namespaces -o wide | grep -v 'Running\|Completed’
- All pods should be healthy; we're looking for Pods crashing or stuck.
- You can not downgrade Rancher; you must do an etcd restore
rke etcd snapshot-restore --name pre-k8s-upgrade-..... --config ./cluster.yaml
- Build a standard RKE cluster Documentation
- Delete
cluster.rkestate
- Delete
kube_config_cluster.yml
rke up
- You should see rke generating new certificates (See example output below)
INFO[0004] [certificates] Generating CA kubernetes certificates
INFO[0005] [certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates
INFO[0005] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0005] [certificates] Generating Kubernetes API server certificates
INFO[0006] [certificates] Generating Service account token key
INFO[0006] [certificates] Generating Kube Controller certificates
INFO[0006] [certificates] Generating Kube Scheduler certificates
INFO[0006] [certificates] Generating Kube Proxy certificates
INFO[0007] [certificates] Generating Node certificate
INFO[0007] [certificates] Generating admin certificates and kubeconfig
- SSH to one of controlplane nodes
- Run the script and follow the instructions given to get a kubeconfig file for the cluster.
- Run the script and follow the instructions given to get a cluster.yaml and cluster.rkestate file for the cluster.
- Copy the files cluster.yml, cluster.rkestate, and kube_config_cluster.yml to a safe location.
- Build a standard RKE cluster Documentation
- Setup helm2
kubectl -n kube-system create serviceaccount tiller kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller helm init --service-account tiller --wait helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
- Install Rancher using helm2
helm install rancher-latest/rancher --name rancher \ --namespace cattle-system \ --set hostname=rancher.example.com \ --set ingress.tls.source=secret \ --version 2.3.10
- Setup helm3
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest helm repo update
- Try to upgrade Rancher
helm upgrade --install rancher rancher-latest/rancher \ --namespace cattle-system \ --set hostname=rancher.example.com \ --set ingress.tls.source=secret \ --version 2.5.5
- Error message
Release "rancher" does not exist. Installing it now. Error: rendered manifests contain a resource that already exists. Unable to continue with install: ServiceAccount "rancher" in namespace "cattle-system" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "rancher"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "cattle-system"
- Take an etcd snapshot
rke etcd snapshot-save --config cluster.yaml --name helm2-helm3-`date '+%Y%m%d%H%M%S'`
- Update annotates and labels for Rancher objects
kubectl annotate --overwrite namespace cattle-system app.kubernetes.io/managed-by=helm kubectl annotate --overwrite namespace cattle-system meta.helm.sh/release-name=rancher kubectl annotate --overwrite namespace cattle-system meta.helm.sh/release-namespace=cattle-system kubectl label --overwrite namespace cattle-system app.kubernetes.io/managed-by=Helm kubectl -n cattle-system annotate --overwrite sa rancher app.kubernetes.io/managed=helm kubectl -n cattle-system annotate --overwrite sa rancher meta.helm.sh/release-name=rancher kubectl -n cattle-system annotate --overwrite sa rancher meta.helm.sh/release-namespace=cattle-system kubectl -n cattle-system label --overwrite sa rancher app.kubernetes.io/managed-by=Helm kubectl -n cattle-system annotate --overwrite ClusterRoleBinding rancher app.kubetes.io/managed-by=helm kubectl -n cattle-system annotate --overwrite ClusterRoleBinding rancher meta.helm.sh/release-name=rancher kubectl -n cattle-system annotate --overwrite ClusterRoleBinding rancher meta.helm.sh/release-namespace=cattle-system kubectl -n cattle-system label --overwrite ClusterRoleBinding rancher app.kubernetes.io/managed-by=Helm kubectl -n cattle-system annotate --overwrite service rancher app.kubernetes.ianaged-by=helm kubectl -n cattle-system annotate --overwrite service rancher meta.helm.sh/release-name=rancher kubectl -n cattle-system annotate --overwrite service rancher meta.helm.sh/release-namespace=cattle-system kubectl -n cattle-system label --overwrite service rancher app.kubernetes.io/managed-by=Helm kubectl -n cattle-system annotate --overwrite Deployment rancher app.kuberne.io/managed-by=helm kubectl -n cattle-system annotate --overwrite Deployment rancher meta.helm.sh/release-name=rancher kubectl -n cattle-system annotate --overwrite Deployment rancher meta.helm.sh/release-namespace=cattle-system kubectl -n cattle-system label --overwrite Deployment rancher app.kubernetes.io/managed-by=Helm kubectl -n cattle-system annotate --overwrite Ingress rancher app.kubernetio/managed-by=helm kubectl -n cattle-system annotate --overwrite Ingress rancher meta.helm.sh/release-name=rancher kubectl -n cattle-system annotate --overwrite Ingress rancher meta.helm.sh/release-namespace=cattle-system kubectl -n cattle-system label --overwrite Ingress rancher app.kubernetes.io/managed-by=Helm
- Upgrade Rancher
helm upgrade --install rancher rancher-latest/rancher \ --namespace cattle-system \ --set hostname=rancher.example.com \ --set ingress.tls.source=secret \ --version 2.5.5
- Build a standard RKE cluster Documentation
- Install Rancher using helm3
kubectl create namespace cattle-system helm upgrade --install rancher rancher-latest/rancher \ --namespace cattle-system \ --set hostname=rancher.example.com \ --set ingress.tls.source=secret \ --set antiAffinity=required \ --version 2.5.5
systemctl stop docker
one of the nodes in the cluster- Error message
NAME STATUS ROLES AGE VERSION mmattox-lab-c-01 Ready controlplane,etcd,worker 9m22s v1.19.7 mmattox-lab-c-02 Ready controlplane,etcd,worker 9m22s v1.19.7 mmattox-lab-c-03 NotReady controlplane,etcd,worker 9m22s v1.19.7
kubectl get pods -n cattle-system -o wide | grep -ve 'Running\|Completed'
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rancher-7df6ff577b-nqjfv 0/1 Pending 0 2m11s <none> <none> <none> <none>
NOTE This should only be done if the node is unrecoverable, and a replacement should be added to the cluster ASAP.
- Upgrading Rancher
- Change replicas to 2 for Rancher deployment
helm upgrade --install rancher rancher-latest/rancher \ --namespace cattle-system \ --set hostname=rancher.example.com \ --set ingress.tls.source=secret \ --set replicas=2 --version 2.5.5
- Change replicas to 2 for Rancher deployment
- Upgrading Kubernetes
- Edit cluster.yml
- Comment out bad node. NOTE You should only remove one node at a time.
- Run a
rke up
- Delete the node from the cluster if RKE left to behind
kubectl delete node mmattox-lab-c-03