Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to AL2023 ami for nvidia and inferentia nodes #988

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion charts/tfy-karpenter-config/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ apiVersion: v2
name: tfy-karpenter-config
description: "ArgoCD Applications for karpenter config"
type: application
version: 0.1.36
version: 0.1.37-rc.1
12 changes: 6 additions & 6 deletions charts/tfy-karpenter-config/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The script in `userData` does the following,
* [AL2023](./files/al2023-soci-default-provisioner-userdata.sh)

### `userData` script for `gpu-default` (GPU) EC2 Node Class
* [AL2](./files/al2-soci-gpu-provisioner-userdata.sh)
* [AL2023](./files/al2023-soci-gpu-provisioner-userdata.sh)


### Debugging
Expand Down Expand Up @@ -69,7 +69,7 @@ https://github.com/awslabs/soci-snapshotter/blob/2e3df4a92415ff02ccc76ed9ceb1c25
| `karpenter.defaultNodeTemplate.instanceProfile` | Instance profile override for the node template | `""` |
| `karpenter.defaultNodeTemplate.rootVolumeSize` | Size for the root volume attached to node | `100Gi` |
| `karpenter.defaultNodeTemplate.extraTags` | Additional tags for the node template. | `{}` |
| `karpenter.defaultNodeTemplate.amiFamily` | AMI family to use for node template | `""` |
| `karpenter.defaultNodeTemplate.amiFamily` | AMI family to use for node template | `AL2023` |
| `karpenter.defaultNodeTemplate.amiSelectorTerms` | AMI selector terms for the node template, conditions are ANDed | `[]` |
| `karpenter.defaultNodeTemplate.detailedMonitoring` | | `false` |
| `karpenter.defaultNodeTemplate.extraSubnetTags` | Additional tags for the subnet. | `{}` |
Expand All @@ -92,7 +92,7 @@ https://github.com/awslabs/soci-snapshotter/blob/2e3df4a92415ff02ccc76ed9ceb1c25
| `karpenter.gpuDefaultNodeTemplate.rootVolumeSize` | Size for the root volume attached to node | `100Gi` |
| `karpenter.gpuDefaultNodeTemplate.extraTags` | Additional tags for the gpu node template. | `{}` |
| `karpenter.gpuDefaultNodeTemplate.detailedMonitoring` | | `false` |
| `karpenter.gpuDefaultNodeTemplate.amiFamily` | AMI family to use for node template | `""` |
| `karpenter.gpuDefaultNodeTemplate.amiFamily` | AMI family to use for node template | `AL2023` |
| `karpenter.gpuDefaultNodeTemplate.amiSelectorTerms` | AMI selector terms for the node template, conditions are ANDed | `[]` |
| `karpenter.gpuDefaultNodeTemplate.extraSubnetTags` | Additional tags for the subnet. | `{}` |
| `karpenter.gpuDefaultNodeTemplate.extraSecurityGroupTags` | Additional tags for the security group. | `{}` |
Expand All @@ -112,7 +112,7 @@ https://github.com/awslabs/soci-snapshotter/blob/2e3df4a92415ff02ccc76ed9ceb1c25
| `karpenter.controlPlaneNodeTemplate.instanceProfile` | Instance profile override for the node template | `""` |
| `karpenter.controlPlaneNodeTemplate.rootVolumeSize` | Size for the root volume attached to node | `100Gi` |
| `karpenter.controlPlaneNodeTemplate.extraTags` | Additional tags for the node template. | `{}` |
| `karpenter.controlPlaneNodeTemplate.amiFamily` | AMI family to use for node template | `""` |
| `karpenter.controlPlaneNodeTemplate.amiFamily` | AMI family to use for node template | `AL2023` |
| `karpenter.controlPlaneNodeTemplate.amiSelectorTerms` | AMI selector terms for the node template, conditions are ANDed | `[]` |
| `karpenter.controlPlaneNodeTemplate.detailedMonitoring` | | `false` |
| `karpenter.controlPlaneNodeTemplate.extraSubnetTags` | Additional tags for the subnet. | `{}` |
Expand All @@ -136,7 +136,7 @@ https://github.com/awslabs/soci-snapshotter/blob/2e3df4a92415ff02ccc76ed9ceb1c25
| `karpenter.inferentiaDefaultNodeTemplate.rootVolumeSize` | Size for the root volume attached to node | `100Gi` |
| `karpenter.inferentiaDefaultNodeTemplate.extraTags` | Additional tags for the node template. | `{}` |
| `karpenter.inferentiaDefaultNodeTemplate.detailedMonitoring` | | `false` |
| `karpenter.inferentiaDefaultNodeTemplate.amiFamily` | AMI family to use for node template | `""` |
| `karpenter.inferentiaDefaultNodeTemplate.amiFamily` | AMI family to use for node template | `AL2023` |
| `karpenter.inferentiaDefaultNodeTemplate.amiSelectorTerms` | AMI selector terms for the node template, conditions are ANDed | `[]` |
| `karpenter.inferentiaDefaultNodeTemplate.extraSubnetTags` | Additional tags for the subnet. | `{}` |
| `karpenter.inferentiaDefaultNodeTemplate.extraSecurityGroupTags` | Additional tags for the security group. | `{}` |
Expand Down Expand Up @@ -175,7 +175,7 @@ https://github.com/awslabs/soci-snapshotter/blob/2e3df4a92415ff02ccc76ed9ceb1c25
| `karpenter.critical.nodeclass.rootVolumeSize` | Size for the root volume attached to the node | `100Gi` |
| `karpenter.critical.nodeclass.extraTags` | Additional tags for the node template. | `{}` |
| `karpenter.critical.nodeclass.detailedMonitoring` | | `false` |
| `karpenter.critical.nodeclass.amiFamily` | AMI family to use for node template | `""` |
| `karpenter.critical.nodeclass.amiFamily` | AMI family to use for node template | `AL2023` |
| `karpenter.critical.nodeclass.amiSelectorTerms` | AMI selector terms for the node template, conditions are ANDed | `[]` |
| `karpenter.critical.nodeclass.extraSubnetTags` | Additional tags for the subnet. | `{}` |
| `karpenter.critical.nodeclass.extraSecurityGroupTags` | Additional tags for the security group. | `{}` |
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
#!/bin/bash
set -ex

version_lte() {
printf '%s\n' "$1" "$2" | sort -C -V
}

version_lt() {
! version_lte "$2" "$1"
}

CONTAINERD_VERSION_OUTPUT="$(containerd --version)"
IFS=" " read -a CONTAINERD_VERSION_STRING <<< "${CONTAINERD_VERSION_OUTPUT}"
CONTAINERD_VERSION=${CONTAINERD_VERSION_STRING[2]}
SANDBOX_IMAGE=$(containerd config dump | grep -oE 'sandbox_image = "(.*)"' | sed 's/sandbox_image = //')
KUBELET_CONFIG_DIR="/etc/kubernetes/kubelet/config.json.d"
SOCI_KUBELET_CONFIG_PATH="$KUBELET_CONFIG_DIR/99-soci.conf"
CONTAINERD_CONFIG_FILEPATH="/etc/containerd/config.toml"
BACKUP_CONTAINERD_CONFIG_FILEPATH="$CONTAINERD_CONFIG_FILEPATH.bak"
SOCI_RELEASE_VERSION="0.7.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try out the version 0.8..0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try out the version 0.8..0

We should pick this up seperatly. Let us release this first.

SOCI_TAR_CHECKSUM="8766cdd479272dcc86299e70a0f7a9343f940c98285c1491bb3c3cdc05b26f47"
SOCI_TAR_FILENAME="soci-snapshotter-$SOCI_RELEASE_VERSION-linux-amd64.tar.gz"
SOCI_CONFIG_DIR="/etc/soci-snapshotter-grpc/"
SOCI_CONFIG_FILEPATH="$SOCI_CONFIG_DIR/config.toml"
SOCI_SYSTEMD_SERVICE_FILEPATH="/etc/systemd/system/soci-snapshotter.service"
cp $CONTAINERD_CONFIG_FILEPATH $BACKUP_CONTAINERD_CONFIG_FILEPATH

setup_soci() {( set -ex

yum install fuse -y
modprobe fuse

curl \
--silent \
--show-error \
--retry 3 \
--retry-delay 1 \
-L https://github.com/awslabs/soci-snapshotter/releases/download/v${SOCI_RELEASE_VERSION}/soci-snapshotter-${SOCI_RELEASE_VERSION}-linux-amd64.tar.gz \
-o $SOCI_TAR_FILENAME
echo "$SOCI_TAR_CHECKSUM $SOCI_TAR_FILENAME" | sha256sum -c

tar -C /usr/local/bin -xvf $SOCI_TAR_FILENAME soci soci-snapshotter-grpc
rm $SOCI_TAR_FILENAME

cat > $SOCI_SYSTEMD_SERVICE_FILEPATH << EOF
[Unit]
Description=soci snapshotter containerd plugin
Documentation=https://github.com/awslabs/soci-snapshotter
After=network.target
Before=containerd.service

[Service]
Type=notify
ExecStart=/usr/local/bin/soci-snapshotter-grpc
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

mkdir -p $SOCI_CONFIG_DIR

cat > $SOCI_CONFIG_FILEPATH << EOF
[cri_keychain]
enable_keychain=true
image_service_path="/run/containerd/containerd.sock"
EOF

systemctl daemon-reload
systemctl enable --now soci-snapshotter
systemctl status soci-snapshotter

echo patching $CONTAINERD_CONFIG_FILEPATH
cat > $CONTAINERD_CONFIG_FILEPATH << EOF
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
discard_unpacked_layers = true
## ENABLE SOCI SNAPSHOTTER
snapshotter = "soci"
disable_snapshot_annotations = false
##

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = $SANDBOX_IMAGE

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/base-runtime-spec.json"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

## SOCI PLUGIN
[proxy_plugins]
[proxy_plugins.soci]
type = "snapshot"
address = "/run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock"
[proxy_plugins.soci.exports]
root = "/var/lib/soci-snapshotter-grpc"
##
EOF

if [ -d "$KUBELET_CONFIG_DIR" ]; then
cat > $SOCI_KUBELET_CONFIG_PATH << EOF
{"apiVersion":"kubelet.config.k8s.io/v1beta1","kind":"KubeletConfiguration","imageServiceEndpoint":"unix:///run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock"}
EOF
else
echo "Kubelet Config dir not found at $KUBELET_CONFIG_PATH. SOCI will not work for private image."
fi

)}

# containerd <1.7.16 does not enforce storage limits
if [ -n "$CONTAINERD_VERSION" ] && version_lte "1.5.0" "$CONTAINERD_VERSION" && version_lt "$CONTAINERD_VERSION" "2.0" && [ ! -z $SANDBOX_IMAGE ]; then
setup_soci
else
echo "CONTAINERD_VERSION is empty or not within the specified range or we could not find the sandbox image."
fi
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,7 @@ spec:
deleteOnTermination: true
{{- if not (eq .Values.karpenter.gpuDefaultNodeTemplate.userData "") }}
{{- if eq .Values.karpenter.gpuDefaultNodeTemplate.userData "default" }}
{{- if eq .Values.karpenter.gpuDefaultNodeTemplate.amiFamily "AL2" }}
userData: {{ .Files.Get "files/al2-gpu-provisioner-userdata.sh" | toYaml | indent 2 }}
{{- else if eq .Values.karpenter.gpuDefaultNodeTemplate.amiFamily "Bottlerocket" }}
{{- if eq .Values.karpenter.gpuDefaultNodeTemplate.amiFamily "Bottlerocket" }}
userData: {{ .Files.Get "files/bottlerocket-gpu-provisioner-userdata.toml" | toYaml | indent 2 }}
{{- end }}
{{- else }}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
{{- if .Values.karpenter.inferentiaDefaultNodeTemplate.enabled }}
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
Expand Down Expand Up @@ -51,4 +52,5 @@ spec:
deleteOnTermination: true
{{- if not (eq .Values.karpenter.inferentiaDefaultNodeTemplate.userData "")}}
userData: {{- .Values.karpenter.inferentiaDefaultNodeTemplate.userData | toYaml | indent 2 }}
{{- end }}
{{- end }}
{{- end }}
14 changes: 7 additions & 7 deletions charts/tfy-karpenter-config/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ karpenter:
## @param karpenter.defaultNodeTemplate.extraTags [object] Additional tags for the node template.
extraTags: {}
## @param karpenter.defaultNodeTemplate.amiFamily AMI family to use for node template
amiFamily: ""
amiFamily: "AL2023"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't apply this, When karpenter was migrated to v1 the webhook migrated itself to use amiSelectorTerms and not amiFamily so if you apply this a drift will be detected in the nodetemplate causing all nodes to disrupt

## @param karpenter.defaultNodeTemplate.amiSelectorTerms [array] AMI selector terms for the node template, conditions are ANDed
# amiSelectorTerms:
# - tags:
Expand Down Expand Up @@ -97,7 +97,7 @@ karpenter:
# Set this to true to enable EC2 detailed cloudwatch monitoring
detailedMonitoring: false
## @param karpenter.gpuDefaultNodeTemplate.amiFamily AMI family to use for node template
amiFamily: ""
amiFamily: "AL2023"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't apply this

## @param karpenter.gpuDefaultNodeTemplate.amiSelectorTerms [array] AMI selector terms for the node template, conditions are ANDed
# amiSelectorTerms:
# - tags:
Expand All @@ -106,7 +106,7 @@ karpenter:
# - name: my-ami
# - id: ami-123
amiSelectorTerms:
- alias: al2@latest
- alias: al2023@latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to release this carefully. This will disrupt all the GPU nodes

## @skip karpenter.gpuDefaultNodeTemplate.userData
# Set this to "default" to let the chart automatically decide this
# Set this to "" to disable this injection
Expand Down Expand Up @@ -183,7 +183,7 @@ karpenter:
## @param karpenter.controlPlaneNodeTemplate.extraTags [object] Additional tags for the node template.
extraTags: {}
## @param karpenter.controlPlaneNodeTemplate.amiFamily AMI family to use for node template
amiFamily: ""
amiFamily: "AL2023"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't apply this

## @param karpenter.controlPlaneNodeTemplate.amiSelectorTerms [array] AMI selector terms for the node template, conditions are ANDed
# amiSelectorTerms:
# - tags:
Expand Down Expand Up @@ -258,7 +258,7 @@ karpenter:
# Set this to true to enable EC2 detailed cloudwatch monitoring
detailedMonitoring: false
## @param karpenter.inferentiaDefaultNodeTemplate.amiFamily AMI family to use for node template
amiFamily: ""
amiFamily: "AL2023"
## @param karpenter.inferentiaDefaultNodeTemplate.amiSelectorTerms [array] AMI selector terms for the node template, conditions are ANDed
# amiSelectorTerms:
# - tags:
Expand All @@ -267,7 +267,7 @@ karpenter:
# - name: my-ami
# - id: ami-123
amiSelectorTerms:
- alias: al2@latest
- alias: al2023@latest
## @skip karpenter.inferentiaDefaultNodeTemplate.userData
# Set this to "" to disable this injection
userData: ""
Expand Down Expand Up @@ -366,7 +366,7 @@ karpenter:
# Set this to true to enable EC2 detailed cloudwatch monitoring
detailedMonitoring: false
## @param karpenter.critical.nodeclass.amiFamily AMI family to use for node template
amiFamily: ""
amiFamily: "AL2023"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't apply this

## @param karpenter.critical.nodeclass.amiSelectorTerms [array] AMI selector terms for the node template, conditions are ANDed
# amiSelectorTerms:
# - tags:
Expand Down