From 95182afe51fdb2a42e98e272fcb99ce8da2019d3 Mon Sep 17 00:00:00 2001
From: Ben Zhang <ben-z@users.noreply.github.com>
Date: Sat, 21 Sep 2024 20:15:52 -0700
Subject: [PATCH] Add details on GPU passthrough and add general maintenance
 procedure (#3150)

## Description

This PR updates the docs with the following changes:
1. Add details on GPU passthrough. This is migrated from infra-notes.
2. Add general maintenance procedure.


## Checklist
- [x] I have read and understood the [WATcloud
Guidelines](https://cloud.watonomous.ca/docs/community-docs/watcloud/guidelines)
- [x] I have performed a self-review of my code
---
 .../watcloud/maintenance-manual.mdx           | 14 +++++++-
 .../docs/community-docs/watcloud/proxmox.mdx  | 36 +++++++++++++++++++
 scripts/generate-assets.js                    |  1 +
 3 files changed, 50 insertions(+), 1 deletion(-)
diff --git a/pages/docs/community-docs/watcloud/maintenance-manual.mdx b/pages/docs/community-docs/watcloud/maintenance-manual.mdx
index 550860f..f5fab59 100644
--- a/pages/docs/community-docs/watcloud/maintenance-manual.mdx
+++ b/pages/docs/community-docs/watcloud/maintenance-manual.mdx
@@ -2,8 +2,20 @@
 
 This manual outlines the maintenance procedures for various components of WATcloud.
 
+## General procedure
+
+1. **Plan the maintenance**: Prepare a plan for the maintenance, including the start time, end time, and the steps to be taken during the maintenance. Identify the components and services that will be affected. Try to minimize the impact on users by using strategies like rolling updates.
+1. **Notify users**: If the maintenance will affect users, [notify them in advance](https://github.com/WATonomous/infrastructure-support/discussions). Make sure to give users plenty of time to prepare for the maintenance. In general, one week's notice is recommended.
+1. **Perform the maintenance**: Follow the steps outlined in the maintenance plan. If the maintenance runs over the scheduled end time, notify users of the delay.
+1. **Verify the maintenance**: After the maintenance is complete, verify that all components are working as expected (including CI pipelines). If there are any issues, address them immediately. Use [observability tools](./observability) to monitor the health of the system.
+1. **Notify users**: Once the maintenance is complete, update the maintenance announcement to indicate that the maintenance is complete. If there were any issues during the maintenance, provide details on what happened and how it was resolved.
+
 ## SLURM
 
+This section outlines the maintenance procedures for the SLURM cluster.
+
+### Cluster overview
+
 To get a general overview of the health of the SLURM cluster, you can run:
 
 ```bash copy
@@ -24,7 +36,7 @@ In the output above, `tr-slurm1` is in the `drained` state, which means it is no
 `thor-slurm1` is in the `mix` state, which means some jobs are running on it.
 All other nodes are in the `idle` state, which means there are no jobs running on them.
 
-To get a detailed overview of the health of the SLURM cluster, you can run:
+To get a detailed overview of nodes in the cluster, you can run:
 
 ```bash copy
 scontrol show node [NODE_NAME]
diff --git a/pages/docs/community-docs/watcloud/proxmox.mdx b/pages/docs/community-docs/watcloud/proxmox.mdx
index 49a04aa..09c085a 100644
--- a/pages/docs/community-docs/watcloud/proxmox.mdx
+++ b/pages/docs/community-docs/watcloud/proxmox.mdx
@@ -1,4 +1,6 @@
 import { Steps } from 'nextra/components'
+import Picture from '@/components/picture'
+import { DocProxmoxPrimaryGpu } from '@/build/fixtures/images'
 
 # Proxmox
 
@@ -649,3 +651,37 @@ If you haven't already, please open a pull request (PR) and review the changes w
 </Steps>
 
 
+## GPU Passthrough
+
+*This note is derived from https://github.com/WATonomous/infra-notes/blob/7786dfe4cd2af74e76530285e1141d06f5ab2df2/gpu-passthrough.md*
+
+Passing through GPUs to VMs is a complex process that requires modifying many parts of the host system.
+Fortunately, the official Proxmox documentation contains a guide on how to do this: [PCI Passthrough](https://pve.proxmox.com/wiki/PCI_Passthrough).
+This guide should be sufficient, and other resources (such as various forums) may provide outdated information or incorrect steps.
+
+### Quirks
+
+- When passing through the GPU, there might need to be some fiddling around with whether or not to enable "Primary GPU": <Picture alt="Primary GPU Setting" image={DocProxmoxPrimaryGpu} />
+  - "Primary GPU" appears to be required when using GTX 1080 on Ubuntu (otherwise `nvidia-smi` throws an error) or Windows VMs (otherwise we get Error 43. Though with primary GPU enabled, we can't access the display from the Proxmox console anymore). It doesn't appear to be required when passing through an RTX 3090 to a Windows VM.
+- When desktop GUI is installed and a display is plugged in, GPU passthrough will stop working. It will error out the first time the VM is started, then it will hang the second time the VM is started. Brief discussion [here](https://discord.com/channels/478659303167885314/571386709233893406/952007213742977044). The solution is to uninstall the desktop GUI or not plug in a display.
+
+The following are some machine-specific quirks that we have encountered.
+
+#### trpro (ASUS WRX80 Motherboard)
+
+When a GPU is placed in the PCIe slot right beside the RAM, passthrough won't work. The error code is something about invalid VBIOS and then an error message about being stuck in power state D3. When we pass the VM a `romfile` manually according to the instructions in the Proxmox PCI passthrough doc the invalid VBIOS error goes away but we are still stuck in power state D3. ([Source](https://discord.com/channels/478659303167885314/580890419379044362/942267415071457311))
+
+### PCI bandwidth and GPUs
+
+PCIe gen4 appears to require a stronger signal quality than gen3.
+When using PCIe risers at gen4 speeds, we encountered PCIe errors.
+See [this thread](https://discord.com/channels/478659303167885314/1163159162432262264)
+for more info on this.
+
+We may be able to resolve this by replacing our current passive risers with retimers.
+See [this discussion](https://github.com/WATonomous/infra-config/discussions/1938) for more information on this.
+However, retimers are much more expensive than passive risers.
+
+The current workaround is to set the link speed to gen3 or below.
+This reduces the bandwidth available to the GPUs, but for a lot of workloads,
+this doesn't substantially decrease the performance.
diff --git a/scripts/generate-assets.js b/scripts/generate-assets.js
index 44be00a..6ad8250 100644
--- a/scripts/generate-assets.js
+++ b/scripts/generate-assets.js
@@ -233,6 +233,7 @@ function generateTypescript(image_names) {
         { name: "computer-dark", uri: new WATcloudURI("watcloud://v1/sha256:7dac34046e20b4a5c4982d2a7940fdb313687d030b72a297adcd2a84d138e099?name=computer-dark.avif") },
         { name: "server-room-light", uri: new WATcloudURI("watcloud://v1/sha256:c3b72b5fb4c7bdff14f293782a98d7b1a21c7f2d6479cb1fa3b1b196a2179f73?name=server-room-light-min.jpg"), optimize: true},
         { name: "server-room-dark", uri: new WATcloudURI("watcloud://v1/sha256:216ca4fdc626b94daaad8a63be5c1a507f82abb2b3bed1839f6d0996ac3e84d2?name=server-room-dark-min.jpg"), optimize: true},
+        { name: "doc-proxmox-primary-gpu", uri: new WATcloudURI("watcloud://v1/sha256:9b7b398205cf6508dce29f07023001baf5eebc287780d7220f50c6965da809ac?name=doc-proxmox-primary-gpu.png"), optimize: true},
     ];
     await Promise.all(IMAGES.map(image => processImage(image)));