Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Slurm autoscaling #18

Draft
wants to merge 175 commits into
base: refactor/review-helm
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
175 commits
Select commit Hold shift + click to select a range
648db22
Added Open Ondemand to image
wtripp180901 Jul 14, 2023
b241c36
Running ood portal generator
wtripp180901 Jul 14, 2023
1995fd9
Trying adding ood user before starts
wtripp180901 Jul 14, 2023
26a4750
Apache runs but auth errors
wtripp180901 Jul 17, 2023
6abcad0
Creating htpasswd file and adding user on startup
wtripp180901 Jul 17, 2023
494a7a5
Now adds rocky as authenticated user and uses htdbm to generate auth …
wtripp180901 Jul 17, 2023
547428b
Updated image + mounted cluster config
wtripp180901 Jul 17, 2023
a1bd370
Trying creating shell directory on startup
wtripp180901 Jul 17, 2023
ee321c9
Trying adding env file to shell directory
wtripp180901 Jul 17, 2023
d48976b
Bump values.yaml
wtripp180901 Jul 17, 2023
e3b8774
Trying installing modules in Dockerfile
wtripp180901 Jul 18, 2023
2172d7b
Trying to cinfugre clusters (not working)
wtripp180901 Jul 18, 2023
3f86fbe
Trying entrypoint tweaks
wtripp180901 Jul 18, 2023
7c541b0
Trying to configure cluster with the login nodes
wtripp180901 Jul 18, 2023
c24c181
Image now sets up rocky OOD password with env variable from secret
wtripp180901 Jul 18, 2023
ad79e16
Rocky OOD password now set as secret from generate-secrets.sh
wtripp180901 Jul 18, 2023
2655d12
Fixed broken mountpath for cluster config
wtripp180901 Jul 19, 2023
44e71b4
Fixed incorrect slurm binaries path
wtripp180901 Jul 19, 2023
804c74d
Updated docs
wtripp180901 Jul 19, 2023
0e2666a
Changed image to allow self-sshing
wtripp180901 Jul 19, 2023
7513b72
Fixed incorrect path
wtripp180901 Jul 19, 2023
4ba0991
Added newline to avoid breaking authorized_keys file
wtripp180901 Jul 19, 2023
833b0d2
Bumped image
wtripp180901 Jul 19, 2023
d38e241
Removed host key generation from login image
wtripp180901 Jul 19, 2023
a89e584
Updated image to copy and set permissions for host keys from mount
wtripp180901 Jul 19, 2023
a6c8e38
Server now has persistent set of host keys from mount
wtripp180901 Jul 19, 2023
7a2480b
Removed comments
wtripp180901 Jul 19, 2023
1345a58
Added https (fixes job composer)
wtripp180901 Jul 19, 2023
0f286ed
Now generates keys for rocky to self-ssh if don't already exist (in i…
wtripp180901 Jul 20, 2023
c094754
Updated image tag
wtripp180901 Jul 20, 2023
3daa29f
Fixed helm merge conflicts
wtripp180901 Jul 20, 2023
a5b71c2
Updated image after merge
wtripp180901 Jul 20, 2023
56c57ef
add kubectl to slurmctl /etc/slurm
sjpb Jul 25, 2023
bfb4770
Merge branch 'refactor/review-helm' into feat/autoscale
sjpb Jul 25, 2023
5705d43
make slurmd resource a template for use from slurmctld pod
sjpb Jul 25, 2023
bf903fc
fix hook mungekey
sjpb Jul 25, 2023
f23c1d7
WIP - autoscale programs and config
sjpb Jul 25, 2023
e2a8041
try to force rebuild with autoscale scripts
sjpb Jul 25, 2023
5b3dc25
try to build image with autoscale scripts
sjpb Jul 25, 2023
85fe5c5
try to build with autoscale scripts
sjpb Jul 25, 2023
740d2d4
try to build with autoscale scripts
sjpb Jul 25, 2023
d96e182
try to build with autoscale scripts
sjpb Jul 25, 2023
456b5aa
try to build with autoscale scripts
sjpb Jul 25, 2023
4d93a5e
install kubectl in image
sjpb Jul 25, 2023
bca49f9
rename kubectl to kubeconfig
sjpb Jul 26, 2023
68b77e5
fix kubernetes repo
sjpb Jul 26, 2023
b130100
move docker build into directory
sjpb Jul 26, 2023
a459fd8
change ownership of kubeconfig
sjpb Jul 26, 2023
2d9cb4c
fix workflow path filter for image build
sjpb Jul 26, 2023
055f0d2
Merge branch 'refactor/review-helm' into feat/autoscale
sjpb Jul 26, 2023
d487326
move kubeconfig out of /etc/slurm volume
sjpb Jul 26, 2023
800ab39
move non-secrets to projected /etc/slurm volume on slurmctld, use /va…
sjpb Jul 26, 2023
bf4ec14
fix perms on slurm secrets dir
sjpb Jul 26, 2023
ecd9cb0
resume/suspend programs write logs to directory with correct permissions
sjpb Jul 26, 2023
d8eeb38
fix duplicate SlurmctldParameters
sjpb Jul 26, 2023
f80b5a8
fix paths for resume/suspend programs
sjpb Jul 26, 2023
79eaf11
bump image
sjpb Jul 26, 2023
688885b
pass slurmd flags via container args
sjpb Jul 27, 2023
c95b5e4
bump image
sjpb Jul 27, 2023
a9b7d4a
pass options to all slurm deamons via container args, set to max debu…
sjpb Jul 27, 2023
3d393f0
bump image
sjpb Jul 27, 2023
f2c222e
add h/w definition for nodes
sjpb Jul 27, 2023
a908255
use reboot flag on slurmd start to make resume work
sjpb Jul 27, 2023
0963ea9
fix NFS-mounted /home permissions
sjpb Jul 27, 2023
348c7ea
bump image
sjpb Jul 27, 2023
4dba961
remove cpu definition from slurm.conf
sjpb Jul 27, 2023
07b2502
Merge branch 'refactor/review-helm' into feat/autoscale
sjpb Jul 27, 2023
1af9d08
bump image
sjpb Jul 27, 2023
d933207
don't use DNS for nodes
sjpb Jul 27, 2023
b8b7d48
use host network for slurmd
sjpb Jul 27, 2023
186db3c
add hostPort to slurmd pods to avoid multiple on one k8s-node
sjpb Jul 27, 2023
416eccd
don't default to 1x CPU
sjpb Jul 28, 2023
06fd904
add back in noaddrcache
sjpb Jul 28, 2023
2771460
Merge pull request #20 from stackhpc/feat/autoscale-hostnetwork
sjpb Jul 28, 2023
9391fdc
remove commented-out topology constraints on slurmd
sjpb Jul 28, 2023
7fd2796
Changed hook to drain nodes before checking for jobs
wtripp180901 Aug 8, 2023
18be119
Updated tag and docs
wtripp180901 Aug 8, 2023
d2531af
Tweaks + now undrains rather than resuming drained nodes
wtripp180901 Aug 8, 2023
856d837
Update tag
wtripp180901 Aug 8, 2023
1c8f39a
Merge pull request #17 from stackhpc/refactor/review-helm
wtripp180901 Aug 8, 2023
2c59b39
Rebuild for merge
wtripp180901 Aug 8, 2023
540ed62
Updated tag
wtripp180901 Aug 8, 2023
845584b
Added entrypoint for post-upgrade hook
wtripp180901 Aug 8, 2023
7a70ba3
Added post-upgrade hook to undrain nodes
wtripp180901 Aug 8, 2023
9e4598e
Merge branch 'main' into feat/autoscale
sjpb Aug 8, 2023
057651a
bump image
sjpb Aug 8, 2023
2b6b2de
Rebuilding image after merge
wtripp180901 Aug 8, 2023
f52e918
Fixed munge
wtripp180901 Aug 8, 2023
303e6f0
Updated tag
wtripp180901 Aug 8, 2023
7ca0668
Moved database auth to helm templating
wtripp180901 Aug 10, 2023
656aa6c
Moved munge key generation to helm
wtripp180901 Aug 10, 2023
a9003f7
Moved OOD password to values/yaml
wtripp180901 Aug 10, 2023
b427120
Random secrets now generated pre-install only
wtripp180901 Aug 10, 2023
e0514f6
Added kubectl to image
wtripp180901 Aug 10, 2023
6244767
Fixed Dockerfile
wtripp180901 Aug 10, 2023
cd0d1af
Testing with separate command
wtripp180901 Aug 10, 2023
4993605
Revert "Testing with separate command"
wtripp180901 Aug 10, 2023
2914c9b
Removed sudo from dockerfile
wtripp180901 Aug 10, 2023
2ace96f
Moved kubernetes repo to separate file
wtripp180901 Aug 10, 2023
763de73
Fixed leftover commands
wtripp180901 Aug 10, 2023
336ec8c
Updated tag and created service account to modify host-keys-secret
wtripp180901 Aug 10, 2023
d58f819
Added entrypoint for host key generation hook
wtripp180901 Aug 10, 2023
16ee05d
Added pre-install hook to generate host keys
wtripp180901 Aug 10, 2023
15b07a6
Removed generate-secrets.sh
wtripp180901 Aug 10, 2023
4b8e114
Now option to give public key explicitly through values.yaml
wtripp180901 Aug 11, 2023
c7a7248
Added custom packaging to workflow
wtripp180901 Aug 11, 2023
69122f7
Trying adding charts to cr packages
wtripp180901 Aug 11, 2023
ca27405
Added source in slurm-cluster-chart/files/httpd.conf
wtripp180901 Aug 11, 2023
1a3c3ad
Added source in slurm-cluster-chart/files/ood_portal.yaml
wtripp180901 Aug 11, 2023
09d2512
Add Known Issues heading to start documenting these
Aug 11, 2023
9979627
Convert Rook NFS to Helm chart
Aug 11, 2023
a4727da
Removed quotes
wtripp180901 Aug 11, 2023
62c6f34
Testing without env file for shell
wtripp180901 Aug 11, 2023
4d90e24
Moved rocky ssh generation to make purpose clearer
wtripp180901 Aug 11, 2023
1a4a3e4
Updated tag
wtripp180901 Aug 11, 2023
edfdd7c
Fix storageClassName templating typo
Aug 11, 2023
4407fbe
Remove broken subPath spec
Aug 11, 2023
f9d4f9a
Changed OOD key names
wtripp180901 Aug 14, 2023
2ac2fd5
Working Helm chart publisher workflow (#25)
wtripp180901 Aug 14, 2023
f25fe6e
Removed resource policies
wtripp180901 Aug 14, 2023
af39470
Fix typo
sd109 Aug 14, 2023
336f95f
Remove yaml anchor
sd109 Aug 14, 2023
5f12196
Remove anchor ref and add explanatory comment
sd109 Aug 14, 2023
350d39b
Add yaml anchor explanation
sd109 Aug 14, 2023
58a89d4
Add comment about name constraints
sd109 Aug 14, 2023
474450b
Refactored and documented values.yaml
wtripp180901 Aug 14, 2023
8c4407c
Merge pull request #15 from stackhpc/ood
wtripp180901 Aug 15, 2023
908f808
Add namespace as command line arg
Aug 14, 2023
925ad80
Add namespace as script arg
Aug 15, 2023
e6c5275
Now gives ownership to rocky affter keygen
wtripp180901 Aug 15, 2023
f32b4f1
Fix dnsConfig namespace
Aug 15, 2023
7c0e2d9
Fixed path
wtripp180901 Aug 15, 2023
171010d
Updated values.yaml
wtripp180901 Aug 15, 2023
584acc4
Merge pull request #26 from stackhpc/hotfix/key-permissions
wtripp180901 Aug 15, 2023
a33790b
Use builtin Helm optional dependency feature
Aug 15, 2023
f86952f
Separate Rook cleanup into correct chart
Aug 15, 2023
1371681
Update docs
Aug 15, 2023
fe58891
Make backing RWO storage class configurable
Aug 15, 2023
303d156
Mention storage capacity config
Aug 15, 2023
1debded
Add note on target namespace
Aug 15, 2023
d465983
Merge branch 'main' into feature/helm-install-nfs
Aug 15, 2023
8818a94
Revert to randomly generated DB password
Aug 15, 2023
6dc8566
Conditionally include backing storage class field
Aug 15, 2023
4c7f875
Changed database template name
wtripp180901 Aug 15, 2023
50e7285
Punctuation
sd109 Aug 16, 2023
729e43c
Clarify namespace arg as optional
Aug 16, 2023
43a5dd7
Re-disable line wrapping
Aug 16, 2023
9cde995
Merge pull request #23 from stackhpc/feature/helm-install-nfs
sjpb Aug 16, 2023
d3daba4
Merge image rebuild
wtripp180901 Aug 16, 2023
7c5b6c4
Updated image
wtripp180901 Aug 16, 2023
e839442
Merge pull request #24 from stackhpc/azimuth-helm
sd109 Aug 16, 2023
968515e
Replaced kubeconfig mount with ServiceAccount
wtripp180901 Aug 16, 2023
e25332e
Added debug to k8s files
wtripp180901 Aug 17, 2023
d313063
only permit one slurmd pod per k8s node
sjpb Aug 17, 2023
e90f227
Added more debugging for k8s
wtripp180901 Aug 17, 2023
6530f78
use host networking
sjpb Aug 17, 2023
f5c1261
Sending debug to log files
wtripp180901 Aug 17, 2023
10b8e8e
Adding kubectl output to logs
wtripp180901 Aug 17, 2023
ef184aa
Adding error check
wtripp180901 Aug 17, 2023
a0193a6
Merge pull request #28 from stackhpc/feat/hostport
sjpb Aug 17, 2023
4d4a15b
Merge branch 'main' into feat/hostnetwork
sjpb Aug 17, 2023
a9ea92b
Adding /dev/tty pipes
wtripp180901 Aug 17, 2023
be00d24
Debug
wtripp180901 Aug 17, 2023
63795d3
Added error redirection
wtripp180901 Aug 17, 2023
def4a77
Merge pull request #29 from stackhpc/feat/hostnetwork
sjpb Aug 17, 2023
a731c60
Fixed missing environment variables in power up/down scripts
wtripp180901 Aug 17, 2023
a2ca5e3
Updated values.yaml and gave all pod permissions to account
wtripp180901 Aug 17, 2023
96d933f
Merge pull request #30 from stackhpc/feat/autoscaler-service-account
sjpb Aug 17, 2023
1f51003
Rebuilding image with fixed merged conflicts
wtripp180901 Aug 18, 2023
89981e6
Updated image tag
wtripp180901 Aug 18, 2023
a0a2323
Merge pull request #21 from stackhpc/hook-race-fix
wtripp180901 Aug 18, 2023
6ca2cd0
Image rebuild with fixed merge conflicts
wtripp180901 Aug 18, 2023
3ebcfe4
Updated image
wtripp180901 Aug 18, 2023
0602876
Pre-merge image rebuild
wtripp180901 Aug 18, 2023
344b9b2
Updated image tag
wtripp180901 Aug 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 18 additions & 29 deletions .github/workflows/publish-helm-chart.yml
Original file line number Diff line number Diff line change
@@ -1,37 +1,26 @@
name: Release Charts

on:
push:
branches:
- main

name: Publish charts
# Run the tasks on every push
on: push
jobs:
release:
# depending on default permission settings for your org (contents being read-only or read-write for workloads), you will have to add permissions
# see: https://docs.github.com/en/actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
contents: write
publish_charts:
name: Build and push Helm charts
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Check out the repository
uses: actions/checkout@v2
with:
# This is important for the semver action to work correctly
# when determining the number of commits since the last tag
fetch-depth: 0
submodules: true

- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "[email protected]"

- name: Install Helm
uses: azure/setup-helm@v3
env:
GITHUB_TOKEN: "${{ secrets.GITHUB_TOKEN }}"
- name: Get SemVer version for current commit
id: semver
uses: stackhpc/github-actions/semver@master

- name: Run chart-releaser
uses: helm/[email protected]
- name: Publish Helm charts
uses: stackhpc/github-actions/helm-publish@master
with:
charts_dir: .
env:
CR_TOKEN: "${{ secrets.GITHUB_TOKEN }}"

token: ${{ secrets.GITHUB_TOKEN }}
version: ${{ steps.semver.outputs.version }}
app-version: ${{ steps.semver.outputs.short-sha }}
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Build artifacts from local helm install
slurm-cluster-chart/Chart.lock
slurm-cluster-chart/charts/
38 changes: 24 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# Slurm Docker Cluster

This is a multi-container Slurm cluster using Kubernetes. The Helm chart
creates a named volume for persistent storage of MySQL data files as well as
an NFS volume for shared storage.
This is a multi-container Slurm cluster using Kubernetes. The Slurm cluster Helm chart creates a named volume for persistent storage of MySQL data files. By default, it also installs the
RookNFS Helm chart (also in this repo) to provide shared storage across the Slurm cluster nodes.

## Dependencies

Expand All @@ -27,51 +26,59 @@ The Helm chart will create the following named volumes:

* var_lib_mysql ( -> /var/lib/mysql )

A named ReadWriteMany (RWX) volume mounted to `/home` is also expected, this can be external or can be deployed using the scripts in the `/nfs` directory (See "Deploying the Cluster")
A named ReadWriteMany (RWX) volume mounted to `/home` is also expected, this can be external or can be deployed using the provided `rooknfs` chart directory (See "Deploying the Cluster").

## Configuring the Cluster

All config files in `slurm-cluster-chart/files` will be mounted into the container to configure their respective services on startup. Note that changes to these files will not all be propagated to existing deployments (see "Reconfiguring the Cluster").
Additional parameters can be found in the `values.yaml` file, which will be applied on a Helm chart deployment. Note that some of these values will also not propagate until the cluster is restarted (see "Reconfiguring the Cluster").
All config files in `slurm-cluster-chart/files` will be mounted into the container to configure their respective services on startup. Note that changes to these files will not all be propagated to existing deployments (see "Reconfiguring the Cluster"). Additional parameters can be found in the `values.yaml` file for the Helm chart. Note that some of these values will also not propagate until the cluster is restarted (see "Reconfiguring the Cluster").

## Deploying the Cluster

### Generating Cluster Secrets

On initial deployment ONLY, run
```console
./generate-secrets.sh
./generate-secrets.sh [<target-namespace>]
```
This generates a set of secrets. If these need to be regenerated, see "Reconfiguring the Cluster"
This generates a set of secrets in the target namespace to be used by the Slurm cluster. If these need to be regenerated, see "Reconfiguring the Cluster"

Be sure to take note of the Open Ondemand credentials, you will need them to access the cluster through a browser

### Connecting RWX Volume

A ReadWriteMany (RWX) volume is required, if a named volume exists, set `nfs.claimName` in the `values.yaml` file to its name. If not, manifests to deploy a Rook NFS volume are provided in the `/nfs` directory. You can deploy this by running
```console
./nfs/deploy-nfs.sh
```
and leaving `nfs.claimName` as the provided value.
A ReadWriteMany (RWX) volume is required for shared storage across cluster nodes. By default, the Rook NFS Helm chart is installed as a dependency of the Slurm cluster chart in order to provide a RWX capable Storage Class for the required shared volume. If the target Kubernetes cluster has an existing storage class which should be used instead, then `storageClass` in `values.yaml` should be set to the name of this existing class and the RookNFS dependency should be disabled by setting `rooknfs.enabled = false`. In either case, the storage capacity of the provisioned RWX volume can be configured by setting the value of `storage.capacity`.

See the separate RookNFS chart [values.yaml](./rooknfs/values.yaml) for further configuration options when using the RookNFS to provide the shared storage volume.

### Supplying Public Keys

To access the cluster via `ssh`, you will need to make your public keys available. All your public keys from localhost can be added by running

```console
./publish-keys.sh
./publish-keys.sh [<target-namespace>]
```
where `<target-namespace>` is the namespace in which the Slurm cluster chart will be deployed (i.e. using `helm install -n <target-namespace> ...`). This will create a Kubernetes Secret in the appropriate namespace for the Slurm cluster to use. Omitting the namespace arg will install the secrets in the default namespace.

### Deploying with Helm

After configuring `kubectl` with the appropriate `kubeconfig` file, deploy the cluster using the Helm chart:
```console
helm install <deployment-name> slurm-cluster-chart
```

NOTE: If using the RookNFS dependency, then the following must be run before installing the Slurm cluster chart
```console
helm dependency update slurm-cluster-chart
```

Subsequent releases can be deployed using:

```console
helm upgrade <deployment-name> slurm-cluster-chart
```

Note: When updating the cluster with `helm upgrade`, a pre-upgrade hook will prevent upgrades if there are running jobs in the Slurm queue. Attempting to upgrade will set all Slurm nodes to `DRAINED` state. If an upgrade fails due to running jobs, you can undrain the nodes either by waiting for running jobs to complete and then retrying the upgrade or by manually undraining them by accessing the cluster as a privileged user. Alternatively you can bypass the hook by running `helm upgrade` with the `--no-hooks` flag (may result in running jobs being lost)

## Accessing the Cluster

Retrieve the external IP address of the login node using:
Expand Down Expand Up @@ -128,6 +135,7 @@ srun singularity exec docker://ghcr.io/stackhpc/mpitests-container:${MPI_CONTAIN
```

Note: The mpirun script assumes you are running as user 'rocky'. If you are running as root, you will need to include the --allow-run-as-root argument

## Reconfiguring the Cluster

### Changes to config files
Expand Down Expand Up @@ -171,3 +179,5 @@ and then restart the other dependent deployments to propagate changes:
```console
kubectl rollout restart deployment slurmd slurmctld login slurmdbd
```

# Known Issues
13 changes: 0 additions & 13 deletions generate-secrets.sh

This file was deleted.

8 changes: 8 additions & 0 deletions image/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,17 @@ LABEL org.opencontainers.image.source="https://github.com/stackhpc/slurm-docker-
ARG SLURM_TAG=slurm-23.02
ARG GOSU_VERSION=1.11

COPY kubernetes.repo /etc/yum.repos.d/kubernetes.repo

RUN set -ex \
&& yum makecache \
&& yum -y update \
&& yum -y install dnf-plugins-core epel-release \
&& yum -y install dnf-plugins-core \
&& yum config-manager --set-enabled powertools \
&& yum -y module enable ruby:2.7 nodejs:14 \
&& yum -y install https://yum.osc.edu/ondemand/2.0/ondemand-release-web-2.0-1.noarch.rpm \
&& yum -y module install ruby nodejs \
&& yum -y install \
wget \
bzip2 \
Expand Down Expand Up @@ -42,6 +47,8 @@ RUN set -ex \
hwloc-devel \
openssh-server \
apptainer \
ondemand \
kubectl \
&& yum clean all \
&& rm -rf /var/cache/yum

Expand Down Expand Up @@ -93,6 +100,7 @@ RUN mkdir /etc/sysconfig/slurm \

VOLUME /etc/slurm
COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
COPY --chown=slurm:slurm --chmod=744 k8s-slurmd-* /usr/local/bin/
ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh"]

CMD ["slurmdbd"]
62 changes: 57 additions & 5 deletions image/docker-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ then
done
echo "-- slurmdbd is now active ..."

echo "---> Setting permissions for state directory ..."
echo "---> Setting ownership for state directory ..."
chown slurm:slurm /var/spool/slurmctld

echo "---> Starting the Slurm Controller Daemon (slurmctld) ..."
Expand Down Expand Up @@ -86,6 +86,8 @@ then
chown root:root /home
chmod 755 /home

echo "---> Setting up ssh for user"

mkdir -p /home/rocky/.ssh
cp /tmp/authorized_keys /home/rocky/.ssh/authorized_keys

Expand All @@ -99,25 +101,75 @@ then
done
popd > /dev/null

echo "---> Complete"
echo "---> Starting sshd"
ssh-keygen -A
cp /tempmounts/etc/ssh/* /etc/ssh/
chmod 600 /etc/ssh/ssh_host_dsa_key
chmod 600 /etc/ssh/ssh_host_ecdsa_key
chmod 600 /etc/ssh/ssh_host_ed25519_key
chmod 600 /etc/ssh/ssh_host_rsa_key
/usr/sbin/sshd

start_munge --foreground
start_munge

echo "---> Setting up self ssh capabilities for OOD"

if [ -f /home/rocky/.ssh/id_rsa.pub ]; then
echo "ssh keys already found"
else
ssh-keygen -t rsa -f /home/rocky/.ssh/id_rsa -N ""
chown rocky:rocky /home/rocky/.ssh/id_rsa /home/rocky/.ssh/id_rsa.pub
fi

ssh-keyscan localhost > /etc/ssh/ssh_known_hosts
echo "" >> /home/rocky/.ssh/authorized_keys #Adding newline to avoid breaking authorized_keys file
cat /home/rocky/.ssh/id_rsa.pub >> /home/rocky/.ssh/authorized_keys

echo "---> Starting Apache Server"

# mkdir --parents /etc/ood/config/apps/shell
# env > /etc/ood/config/apps/shell/env

/usr/libexec/httpd-ssl-gencerts
/opt/ood/ood-portal-generator/sbin/update_ood_portal
mkdir --parents /opt/rh/httpd24/root/etc/httpd/

/usr/bin/htdbm -cb /opt/rh/httpd24/root/etc/httpd/.htpasswd.dbm rocky $ROCKY_OOD_PASS
/usr/sbin/httpd -k start -X -e debug

elif [ "$1" = "check-queue-hook" ]
then
start_munge

scontrol update NodeName=all State=DRAIN Reason="Preventing new jobs running before upgrade"

RUNNING_JOBS=$(squeue --states=RUNNING,COMPLETING,CONFIGURING,RESIZING,SIGNALING,STAGE_OUT,STOPPED,SUSPENDED --noheader --array | wc --lines)

if [[ $RUNNING_JOBS -eq 0 ]]
then
exit 0
exit 0
else
exit 1
exit 1
fi

elif [ "$1" = "undrain-nodes-hook" ]
then
start_munge
scontrol update NodeName=all State=UNDRAIN
exit 0

elif [ "$1" = "generate-keys-hook" ]
then
mkdir -p ./temphostkeys/etc/ssh
ssh-keygen -A -f ./temphostkeys
kubectl create secret generic host-keys-secret \
--dry-run=client \
--from-file=./temphostkeys/etc/ssh \
-o yaml | \
kubectl apply -f -

exit 0

elif [ "$1" = "debug" ]
then
start_munge --foreground
Expand Down
16 changes: 16 additions & 0 deletions image/k8s-slurmd-create
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/bash

echo "$(date) Resume invoked $0 $*" &>> /var/log/slurm/power_save.log

APISERVER=https://kubernetes.default.svc
SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
TOKEN=$(cat ${SERVICEACCOUNT}/token)
CACERT=${SERVICEACCOUNT}/ca.crt

hosts=$(scontrol show hostnames $1) # this is purely a textual expansion, doens't depend on defined nodes
for host in $hosts
do
( sed s/SLURMD_NODENAME/$host/ /etc/slurm/slurmd-pod-template.yml | \
kubectl --server $APISERVER --token $TOKEN --certificate-authority $CACERT create -f - )
done
15 changes: 15 additions & 0 deletions image/k8s-slurmd-delete
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/bash

echo "$(date) Suspend invoked $0 $*" >> /var/log/slurm/power_save.log

APISERVER=https://kubernetes.default.svc
SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
TOKEN=$(cat ${SERVICEACCOUNT}/token)
CACERT=${SERVICEACCOUNT}/ca.crt

hosts=$(scontrol show hostnames $1) # this is purely a textual expansion, doens't depend on defined nodes
for host in $hosts
do
kubectl --server $APISERVER --token $TOKEN --certificate-authority $CACERT delete pod $host
done
6 changes: 6 additions & 0 deletions image/kubernetes.repo
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-$basearch
enabled=1
gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
11 changes: 0 additions & 11 deletions nfs/deploy-nfs.sh

This file was deleted.

11 changes: 0 additions & 11 deletions nfs/pvc.yaml

This file was deleted.

16 changes: 0 additions & 16 deletions nfs/teardown-nfs.sh

This file was deleted.

9 changes: 7 additions & 2 deletions publish-keys.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
kubectl create configmap authorized-keys-configmap \
NAMESPACE="$1"
if [[ -z $1 ]]; then
NAMESPACE=default
fi
echo Installing in namespace $NAMESPACE
kubectl -n $NAMESPACE create configmap authorized-keys-configmap \
"--from-literal=authorized_keys=$(cat ~/.ssh/*.pub)" --dry-run=client -o yaml | \
kubectl apply -f -
kubectl -n $NAMESPACE apply -f -
Loading