Skip to content

Commit

Permalink
Add details on GPU passthrough and add general maintenance procedure …
Browse files Browse the repository at this point in the history
…(#3150)

## Description

This PR updates the docs with the following changes:
1. Add details on GPU passthrough. This is migrated from infra-notes.
2. Add general maintenance procedure.


## Checklist
- [x] I have read and understood the [WATcloud
Guidelines](https://cloud.watonomous.ca/docs/community-docs/watcloud/guidelines)
- [x] I have performed a self-review of my code
  • Loading branch information
ben-z authored Sep 22, 2024
1 parent 1ead418 commit 95182af
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 1 deletion.
14 changes: 13 additions & 1 deletion pages/docs/community-docs/watcloud/maintenance-manual.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,20 @@

This manual outlines the maintenance procedures for various components of WATcloud.

## General procedure

1. **Plan the maintenance**: Prepare a plan for the maintenance, including the start time, end time, and the steps to be taken during the maintenance. Identify the components and services that will be affected. Try to minimize the impact on users by using strategies like rolling updates.
1. **Notify users**: If the maintenance will affect users, [notify them in advance](https://github.com/WATonomous/infrastructure-support/discussions). Make sure to give users plenty of time to prepare for the maintenance. In general, one week's notice is recommended.
1. **Perform the maintenance**: Follow the steps outlined in the maintenance plan. If the maintenance runs over the scheduled end time, notify users of the delay.
1. **Verify the maintenance**: After the maintenance is complete, verify that all components are working as expected (including CI pipelines). If there are any issues, address them immediately. Use [observability tools](./observability) to monitor the health of the system.
1. **Notify users**: Once the maintenance is complete, update the maintenance announcement to indicate that the maintenance is complete. If there were any issues during the maintenance, provide details on what happened and how it was resolved.

## SLURM

This section outlines the maintenance procedures for the SLURM cluster.

### Cluster overview

To get a general overview of the health of the SLURM cluster, you can run:

```bash copy
Expand All @@ -24,7 +36,7 @@ In the output above, `tr-slurm1` is in the `drained` state, which means it is no
`thor-slurm1` is in the `mix` state, which means some jobs are running on it.
All other nodes are in the `idle` state, which means there are no jobs running on them.

To get a detailed overview of the health of the SLURM cluster, you can run:
To get a detailed overview of nodes in the cluster, you can run:

```bash copy
scontrol show node [NODE_NAME]
Expand Down
36 changes: 36 additions & 0 deletions pages/docs/community-docs/watcloud/proxmox.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
import { Steps } from 'nextra/components'
import Picture from '@/components/picture'
import { DocProxmoxPrimaryGpu } from '@/build/fixtures/images'

# Proxmox

Expand Down Expand Up @@ -649,3 +651,37 @@ If you haven't already, please open a pull request (PR) and review the changes w
</Steps>


## GPU Passthrough

*This note is derived from https://github.com/WATonomous/infra-notes/blob/7786dfe4cd2af74e76530285e1141d06f5ab2df2/gpu-passthrough.md*

Passing through GPUs to VMs is a complex process that requires modifying many parts of the host system.
Fortunately, the official Proxmox documentation contains a guide on how to do this: [PCI Passthrough](https://pve.proxmox.com/wiki/PCI_Passthrough).
This guide should be sufficient, and other resources (such as various forums) may provide outdated information or incorrect steps.

### Quirks

- When passing through the GPU, there might need to be some fiddling around with whether or not to enable "Primary GPU": <Picture alt="Primary GPU Setting" image={DocProxmoxPrimaryGpu} />
- "Primary GPU" appears to be required when using GTX 1080 on Ubuntu (otherwise `nvidia-smi` throws an error) or Windows VMs (otherwise we get Error 43. Though with primary GPU enabled, we can't access the display from the Proxmox console anymore). It doesn't appear to be required when passing through an RTX 3090 to a Windows VM.
- When desktop GUI is installed and a display is plugged in, GPU passthrough will stop working. It will error out the first time the VM is started, then it will hang the second time the VM is started. Brief discussion [here](https://discord.com/channels/478659303167885314/571386709233893406/952007213742977044). The solution is to uninstall the desktop GUI or not plug in a display.

The following are some machine-specific quirks that we have encountered.

#### trpro (ASUS WRX80 Motherboard)

When a GPU is placed in the PCIe slot right beside the RAM, passthrough won't work. The error code is something about invalid VBIOS and then an error message about being stuck in power state D3. When we pass the VM a `romfile` manually according to the instructions in the Proxmox PCI passthrough doc the invalid VBIOS error goes away but we are still stuck in power state D3. ([Source](https://discord.com/channels/478659303167885314/580890419379044362/942267415071457311))

### PCI bandwidth and GPUs

PCIe gen4 appears to require a stronger signal quality than gen3.
When using PCIe risers at gen4 speeds, we encountered PCIe errors.
See [this thread](https://discord.com/channels/478659303167885314/1163159162432262264)
for more info on this.

We may be able to resolve this by replacing our current passive risers with retimers.
See [this discussion](https://github.com/WATonomous/infra-config/discussions/1938) for more information on this.
However, retimers are much more expensive than passive risers.

The current workaround is to set the link speed to gen3 or below.
This reduces the bandwidth available to the GPUs, but for a lot of workloads,
this doesn't substantially decrease the performance.
1 change: 1 addition & 0 deletions scripts/generate-assets.js
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,7 @@ function generateTypescript(image_names) {
{ name: "computer-dark", uri: new WATcloudURI("watcloud://v1/sha256:7dac34046e20b4a5c4982d2a7940fdb313687d030b72a297adcd2a84d138e099?name=computer-dark.avif") },
{ name: "server-room-light", uri: new WATcloudURI("watcloud://v1/sha256:c3b72b5fb4c7bdff14f293782a98d7b1a21c7f2d6479cb1fa3b1b196a2179f73?name=server-room-light-min.jpg"), optimize: true},
{ name: "server-room-dark", uri: new WATcloudURI("watcloud://v1/sha256:216ca4fdc626b94daaad8a63be5c1a507f82abb2b3bed1839f6d0996ac3e84d2?name=server-room-dark-min.jpg"), optimize: true},
{ name: "doc-proxmox-primary-gpu", uri: new WATcloudURI("watcloud://v1/sha256:9b7b398205cf6508dce29f07023001baf5eebc287780d7220f50c6965da809ac?name=doc-proxmox-primary-gpu.png"), optimize: true},
];
await Promise.all(IMAGES.map(image => processImage(image)));

Expand Down

0 comments on commit 95182af

Please sign in to comment.