Skip to content

Commit

Permalink
update maintenance tasks for galaxy
Browse files Browse the repository at this point in the history
  • Loading branch information
TimStewartJ committed Mar 2, 2021
1 parent df8e847 commit 442ef99
Showing 1 changed file with 14 additions and 15 deletions.
29 changes: 14 additions & 15 deletions docs/maintenance-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,29 +13,28 @@ any issues that it finds. After the scrub is finished,
it is good to also run `zpool status` to check if there
is anything wrong.

This command is run on a cronjob, so there should be no
need for manual intervention. The cronjob runs at 8:00AM
the first day of every month. Gravity will also send out an
This command was previously run on a cronjob. The cronjob ran at 8:00AM
the first day of every month. Gravity would also send out an
email at 8:10AM on the same day with the results. If the
scrub is fine, the email should be titled
scrub is fine, the email would be titled
`[All clear] Hen monthly ZFS report`. If the title
says `[POTENTIAL ZFS ISSUE]` instead, there may be
something wrong with the disk, and the email contains
the `zpool status` output which you can use to debug disk
issues. Details on how the cronjob is setup are in the
the `zpool status` output which can be used to debug disk
issues. Details on how the cronjob was setup are in the
private configuration repo, under `cronjob/monthly-zfs-report.py`.

## Cluster control plane upgrade

The Kubernetes control plane should be upgraded regularly.
There is a cronjob sending out a triyearly reminder
There used to be a cronjob sending out a triyearly reminder
(Jan, May, Sept 1st of every year) reminding you to do the
upgrade. (The cronjob can be found in galaxy-control-repo.)

This must be done at least once a year, otherwise the
Kubernetes certificates may expire, which will
[break the entire cluster](/docs/Bare-Metal/troubleshooting/KubeadmCert.md#more-complex-solution-renewing-kubeadm-certificates).
The email sent out should contain the date on which the
The email sent out contains the date on which the
certificates expire. If you apply routine upgrades,
certificates should be renewed automatically and this
should not be an issue. If the certificates do expire,
Expand All @@ -47,19 +46,19 @@ cluster upgrades indefinitely.

### Upgrading the control plane

tl;dr: Follow https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ and upgrade the control plane, update apt packages on the nodes, then copy `/etc/kubernetes/admin.conf` on chick0 into `/home/spicy/.kube/config` on rooster.
tl;dr: Follow https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ and upgrade the control planes, update apt packages on the nodes, then copy `/etc/kubernetes/admin.conf` on a nebula into `/home/milky/.kube/config` on gravity.

Note: This will take a while and will cause some downtime. Be sure to notify users beforehand.

1. Follow the instructions in the [official cluster upgrade guide](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/). We only have one control plane node (chick0), so follow the "Upgrade the first control plane node" secion on chick0 and ignore "Upgrade additional control plane nodes". Also follow "Upgrade worker nodes" on all other chicks. This is also a good time to [upgrade all packages on the chicks](https://github.com/LibreTexts/metalc/blob/447a459bacfbc6a29d80229e7df2f2bfb953cd7a/docs/updating-ubuntu-kubelet.md) as well, since the chicks are cordoned during the upgrade.
2. Once you verified the cluster is working using `kubectl get nodes` on rooster, copy over the newer admin certificate/key.
1. SSH into chick0, then do `sudo cp /etc/kubernetes/admin.conf /home/spicy/.kube/config`. Also `chown spicy:spicy /home/spicy/.kube/config` to make it readable to us.
2. Go back into rooster and do `scp chick0:.kube/config ~/.kube/config` to copy the file onto rooster.
3. Verify `kubectl `works on both rooster and chick0 by running any kubectl command (such as `kubectl get nodes`).
1. Follow the instructions in the [official cluster upgrade guide](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/). We have multiple control plane nodes (nebulas), so follow both the "Upgrade the first control plane node" and "Upgrade additional control plane nodes" sections. Also follow "Upgrade worker nodes" on all other worker nodes. This is also a good time to [upgrade all packages on the stars](https://github.com/LibreTexts/metalc/blob/447a459bacfbc6a29d80229e7df2f2bfb953cd7a/docs/updating-ubuntu-kubelet.md) as well, since the stars are cordoned during the upgrade.
2. Once you verified the cluster is working using `kubectl get nodes` on gravity/quantum, copy over the newer admin certificate/key.
1. SSH into a nebula, then do `sudo cp /etc/kubernetes/admin.conf /home/milky/.kube/config`. Also `chown milky:milky /home/milky/.kube/config` to make it readable to us.
2. Go back into gravity and do `scp nebula{1,5}:.kube/config ~/.kube/config` to copy the file onto gravity.
3. Verify `kubectl` works on both gravity and the nebulas by running any kubectl command (such as `kubectl get nodes`).

### Renew certificates without upgrade

Sometimes you want to renew the certificates without doing a proper upgrade and causing downtime. In that case, do the following:

1. Follow [the official guide](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal). tl;dr: Just run `sudo kubeadm alpha certs renew` on chick0.
1. Follow [the official guide](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal). tl;dr: Just run `sudo kubeadm alpha certs renew` on all of the nebulas.
2. Follow the same step 2 as [Upgrading the control plane](#upgrading-the-control-plane).

0 comments on commit 442ef99

Please sign in to comment.