Skip to content

Commit

Permalink
Fixed manifest conflicts, needs rebuild
Browse files Browse the repository at this point in the history
  • Loading branch information
wtripp180901 committed Jul 6, 2023
2 parents ada1e0b + 315c7bc commit d3a7241
Show file tree
Hide file tree
Showing 24 changed files with 68 additions and 124 deletions.
7 changes: 0 additions & 7 deletions .github/workflows/build-containers.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,19 +48,12 @@ jobs:
uses: docker/build-push-action@v4
with:
provenance: false
#context: ./images/
push: true
tags: ${{ steps.image-meta.outputs.tags }}
labels: ${{ steps.image-meta.outputs.labels }}
cache-from: type=local,src=/tmp/.buildx-cache
cache-to: type=local,dest=/tmp/.buildx-cache-new,mode=max

#- name: Verify push to GHCR
# run: docker inspect ${{ fromJSON(steps.image-meta.outputs.json).tags[1] }}

#- name: Test image by running help
# run: singularity exec docker://${{ fromJSON(steps.image-meta.outputs.json).tags[1] }} /io500 -h

# Temp fix
# https://github.com/docker/build-push-action/issues/252
# https://github.com/moby/buildkit/issues/1896
Expand Down
8 changes: 3 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,10 @@ RUN mkdir /etc/sysconfig/slurm \
/var/lib/slurmd/assoc_usage \
/var/lib/slurmd/qos_usage \
/var/lib/slurmd/fed_mgr_state \
&& groupadd -r --gid=990 slurm \
&& useradd -r -g slurm --uid=990 slurm \
&& useradd -r --uid=990 slurm \
&& chown -R slurm:slurm /var/*/slurm* \
&& groupadd --gid=1000 rocky \
&& useradd -g rocky --uid=1000 rocky \
&& usermod -p '*' rocky
&& useradd -u 1000 rocky \
&& usermod -p '*' rocky # unlocks account but sets no password

VOLUME /etc/slurm
COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
Expand Down
31 changes: 28 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,24 @@ The Helm chart will run the following containers:
* slurmctld
* slurmd (2 replicas by default)

The compose file will create the following named volumes:
The Helm chart will create the following named volumes:

* nfs-server-volume ( -> /home )
* var_lib_mysql ( -> /var/lib/mysql )

## Configuring the Cluster

All config files in `slurm-cluster-chart/files` will be mounted into the container to configure their respective services on startup. The `authorized_keys` file contains authorised public keys for the user `rocky`, add your public key to access the cluster. Note that changes to these files will not all be propagated to existing deployments (see "Reconfiguring the Cluster").
Additional parameters can be found in the `values.yaml` file, which will be applied on a Helm chart deployment. Note that some of these values, such as `encodedMungeKey` will also not propagate until the cluster is restarted (see Reconfiguring the Cluster).
Additional parameters can be found in the `values.yaml` file, which will be applied on a Helm chart deployment. Note that some of these values will also not propagate until the cluster is restarted (see "Reconfiguring the Cluster").

## Deploying the Cluster

On initial deployment ONLY, run
```console
./generate-secrets.sh
```
This generates a set of secrets. If these need to be regenerated, see "Reconfiguring the Cluster"

After configuring `kubectl` with the appropriate `kubeconfig` file, deploy the cluster using the Helm chart:
```console
helm install <deployment-name> slurm-cluster-chart
Expand Down Expand Up @@ -99,8 +105,27 @@ echo $SLURM_JOB_ID: $SLURM_JOB_NODELIST
Note: The mpirun script assumes you are running as user 'rocky'. If you are running as root, you will need to include the --allow-run-as-root argument
## Reconfiguring the Cluster

### Changes to config files

To guarantee changes to config files are propagated to the cluster, use
```console
kubectl rollout restart deployment <deployment-names>
```
Generally restarts to `slurmd`, `slurmctld`, `login` and `slurmdbd` will be required
Generally restarts to `slurmd`, `slurmctld`, `login` and `slurmdbd` will be required

### Changes to secrets

Regenerate secrets by rerunning
```console
./generate-secrets.sh
```
Some secrets are persisted in volumes, so cycling them requires a full teardown and reboot of the volumes and pods which these volumes are mounted on. Run
```console
kubectl delete deployment mysql
kubectl delete pvc var-lib-mysql
helm upgrade <deployment-name> slurm-cluster-chart
```
and then restart the other dependent deployments to propagate changes:
```console
kubectl rollout restart deployment slurmd slurmctld login slurmdbd
```
11 changes: 5 additions & 6 deletions docker-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
set -euo pipefail

cp /tempmounts/munge.key /etc/munge/munge.key
chown 998:998 /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
chmod 600 /etc/munge/munge.key

if [ "$1" = "slurmdbd" ]
Expand Down Expand Up @@ -72,16 +72,15 @@ fi
if [ "$1" = "login" ]
then

mkdir /home/rocky || true
mkdir /home/rocky/.ssh || true
mkdir -p /home/rocky/.ssh
cp tempmounts/authorized_keys /home/rocky/.ssh/authorized_keys

echo "---> Setting permissions for user home directories"
cd /home
for DIR in */;
do USER=$( echo $DIR | sed "s/.$//" ) && (chown -R $USER:$USER $USER || echo "Failed to take ownership of $USER") \
&& (chmod 700 /home/$USER/.ssh || echo "Couldn't set permissions for .ssh directory for $USER") \
&& (chmod 600 /home/$USER/.ssh/authorized_keys || echo "Couldn't set permissions for .ssh/authorized_keys for $USER");
do USER_TO_SET=$( echo $DIR | sed "s/.$//" ) && (chown -R $USER_TO_SET:$USER_TO_SET $USER_TO_SET || echo "Failed to take ownership of $USER_TO_SET") \
&& (chmod 700 /home/$USER_TO_SET/.ssh || echo "Couldn't set permissions for .ssh directory for $USER_TO_SET") \
&& (chmod 600 /home/$USER_TO_SET/.ssh/authorized_keys || echo "Couldn't set permissions for .ssh/authorized_keys for $USER_TO_SET");
done
echo "---> Complete"
echo "Starting sshd"
Expand Down
13 changes: 13 additions & 0 deletions generate-secrets.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

kubectl create secret generic database-auth-secret \
--dry-run=client \
--from-literal=password=$(tr -dc 'A-Za-z0-9' </dev/urandom | head -c 32) \
-o yaml | \
kubectl apply -f -

kubectl create secret generic munge-key-secret \
--dry-run=client \
--from-literal=munge.key=$(dd if=/dev/urandom bs=1 count=1024 2>/dev/null | base64 -w 0) \
-o yaml | \
kubectl apply -f -
3 changes: 2 additions & 1 deletion slurm-cluster-chart/templates/authorized-keys-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ metadata:
name: {{ .Values.configmaps.authorizedKeys }}
data:
authorized_keys: |
{{- .Files.Get "files/authorized_keys" | nindent 4 -}}
{{- .Files.Get "files/authorized_keys" | nindent 4 -}}
9 changes: 0 additions & 9 deletions slurm-cluster-chart/templates/database-auth-secret.yaml

This file was deleted.

8 changes: 2 additions & 6 deletions slurm-cluster-chart/templates/login-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,13 @@ metadata:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: login
app.kubernetes.io/part-of: slurm-docker-cluster
name: login
spec:
replicas: {{ .Values.replicas.login }}
selector:
matchLabels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: login
app.kubernetes.io/part-of: slurm-docker-cluster
strategy:
type: Recreate
template:
Expand All @@ -22,7 +20,6 @@ spec:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: login
app.kubernetes.io/part-of: slurm-docker-cluster
spec:
containers:
- args:
Expand All @@ -32,7 +29,7 @@ spec:
ports:
- containerPort: 22
volumeMounts:
- mountPath: {{ .Values.sharedDirectory.mountPath }}
- mountPath: {{ .Values.nfs.mountPath }}
name: slurm-jobdir
- mountPath: /etc/slurm/slurm.conf
name: slurm-config-volume
Expand All @@ -53,7 +50,7 @@ spec:
- name: slurm-jobdir
nfs:
server: {{ .Values.nfs.server }}
path: {{ .Values.nfs.path }}
path: {{ .Values.nfs.exportPath }}
- name: slurm-config-volume
configMap:
name: {{ .Values.configmaps.slurmConf }}
Expand All @@ -66,4 +63,3 @@ spec:
- name: authorized-keys
configMap:
name: {{ .Values.configmaps.authorizedKeys }}
status: {}
5 changes: 1 addition & 4 deletions slurm-cluster-chart/templates/login-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ metadata:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: login
app.kubernetes.io/part-of: slurm-docker-cluster
name: login
spec:
ports:
Expand All @@ -16,6 +15,4 @@ spec:
selector:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: login
app.kubernetes.io/part-of: slurm-docker-cluster
status:
loadBalancer: {}

6 changes: 0 additions & 6 deletions slurm-cluster-chart/templates/munge-key-secret.yaml

This file was deleted.

1 change: 0 additions & 1 deletion slurm-cluster-chart/templates/mysql-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,3 @@ spec:
- name: var-lib-mysql
persistentVolumeClaim:
claimName: var-lib-mysql
status: {}
2 changes: 0 additions & 2 deletions slurm-cluster-chart/templates/mysql-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,3 @@ spec:
selector:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: mysql
status:
loadBalancer: {}
16 changes: 1 addition & 15 deletions slurm-cluster-chart/templates/nfs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,18 +43,4 @@ spec:
- name: storage
persistentVolumeClaim:
claimName: nfs-server-volume
#---
#apiVersion: v1
#kind: Service
#metadata:
# name: nfs-service
#spec:
# ports:
# - name: nfs
# port: 2049
# - name: mountd
# port: 20048
# - name: rpcbind
# port: 111
# selector:
# app: nfs-server # must match with the label of NFS pod

3 changes: 2 additions & 1 deletion slurm-cluster-chart/templates/slurm-conf-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ metadata:
name: {{ .Values.configmaps.slurmConf }}
data:
slurm.conf: |
{{- .Files.Get "files/slurm.conf" | nindent 4 -}}
{{- .Files.Get "files/slurm.conf" | nindent 4 -}}
13 changes: 2 additions & 11 deletions slurm-cluster-chart/templates/slurmctld-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,13 @@ metadata:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmctld
app.kubernetes.io/part-of: slurm-docker-cluster
name: slurmctld
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmctld
app.kubernetes.io/part-of: slurm-docker-cluster
strategy:
type: Recreate
template:
Expand All @@ -22,7 +20,6 @@ spec:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmctld
app.kubernetes.io/part-of: slurm-docker-cluster
spec:
containers:
- args:
Expand All @@ -33,10 +30,8 @@ spec:
- containerPort: 6817
resources: {}
volumeMounts:
- mountPath: {{ .Values.sharedDirectory.mountPath }}
- mountPath: {{ .Values.nfs.mountPath }}
name: slurm-jobdir
#- mountPath: /var/log/slurm
# name: var-log-slurm
- mountPath: /etc/slurm/slurm.conf
name: slurm-config-volume
subPath: slurm.conf
Expand All @@ -49,14 +44,10 @@ spec:
- name: slurm-jobdir
nfs:
server: {{ .Values.nfs.server }}
path: {{ .Values.nfs.path }}
#- name: var-log-slurm
# persistentVolumeClaim:
# claimName: var-log-slurm
path: {{ .Values.nfs.exportPath }}
- name: slurm-config-volume
configMap:
name: {{ .Values.configmaps.slurmConf }}
- name: munge-key-secret
secret:
secretName: {{ .Values.secrets.mungeKey }}
status: {}
4 changes: 0 additions & 4 deletions slurm-cluster-chart/templates/slurmctld-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ metadata:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmctld
app.kubernetes.io/part-of: slurm-docker-cluster
name: slurmctld
spec:
ports:
Expand All @@ -15,6 +14,3 @@ spec:
selector:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmctld
app.kubernetes.io/part-of: slurm-docker-cluster
status:
loadBalancer: {}
8 changes: 2 additions & 6 deletions slurm-cluster-chart/templates/slurmd-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,13 @@ metadata:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmd
app.kubernetes.io/part-of: slurm-docker-cluster
name: slurmd
spec:
replicas: {{ .Values.replicas.slurmd }}
selector:
matchLabels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmd
app.kubernetes.io/part-of: slurm-docker-cluster
strategy:
type: Recreate
template:
Expand All @@ -22,7 +20,6 @@ spec:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmd
app.kubernetes.io/part-of: slurm-docker-cluster
spec:
containers:
- args:
Expand All @@ -36,7 +33,7 @@ spec:
- mountPath: /etc/slurm/slurm.conf
name: slurm-config-volume
subPath: slurm.conf
- mountPath: {{ .Values.sharedDirectory.mountPath }}
- mountPath: {{ .Values.nfs.mountPath }}
name: slurm-jobdir
- mountPath: /tempmounts/munge.key
name: munge-key-secret
Expand All @@ -46,11 +43,10 @@ spec:
- name: slurm-jobdir
nfs:
server: {{ .Values.nfs.server }}
path: {{ .Values.nfs.path }}
path: {{ .Values.nfs.exportPath }}
- name: slurm-config-volume
configMap:
name: {{ .Values.configmaps.slurmConf }}
- name: munge-key-secret
secret:
secretName: {{ .Values.secrets.mungeKey }}
status: {}
4 changes: 0 additions & 4 deletions slurm-cluster-chart/templates/slurmd-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ metadata:
labels:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmd
app.kubernetes.io/part-of: slurm-docker-cluster
name: slurmd
spec:
ports:
Expand All @@ -15,6 +14,3 @@ spec:
selector:
app.kubernetes.io/name: slurm
app.kubernetes.io/component: slurmd
app.kubernetes.io/part-of: slurm-docker-cluster
status:
loadBalancer: {}
Loading

0 comments on commit d3a7241

Please sign in to comment.