Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talos Support #379

Open
evilhamsterman opened this issue Feb 23, 2024 · 25 comments
Open

Talos Support #379

evilhamsterman opened this issue Feb 23, 2024 · 25 comments

Comments

@evilhamsterman
Copy link

Talos is becoming more popular but currently the csi-driver doesn't work with it. If we need to do manual configuration of things like the iscsi and multipath we can do that by pushing values/files in the machine config. But the biggest hitch to me appears to be the requirement to create and mount /etc/hpe-storage on the host. That works on CoreOS but does not on Talos because basically the whole system is RO.

From what I can see that mount is needed to store a unique id for the node, couldn't you use the already existing unique ide and store specific data in ConfigMaps.

@datamattsson
Copy link
Collaborator

Hey, thanks for the interest! We've been kicking this around for a bit and I filed an internal JIRA to move the identifier to the Kubernetes control-plane instead. I've had some heated conversations with Andrew from the Talos project and I'm not 100% sure moving the identifier to Kubernetes will solve all our problems.

If you are an existing HPE customer or a prospect, you should work with your account team and mention this requirement. That is the fastest route.

@evilhamsterman
Copy link
Author

I don't think moving the ID to the control plane would solve all the problems, but it's a start. Maybe at least making it possible to set the /etc/hpe-storage mount path so we can specify Talos' ephemeral environment? It's possible with Kustomize but that's an extra step. I do plan on talking with our account rep but wanted to get it on the board here.

@datamattsson
Copy link
Collaborator

Internal JIRA is CON-1838.

@mikart143
Copy link

Hi, are there any news about support for Talos?

@datamattsson
Copy link
Collaborator

It did not make it into the 2.5.0 release. I was going to do some research on it but it got delayed.

@evilhamsterman
Copy link
Author

I'm glad to hear that it is actively being pursued at least. I will likely be deploying a new cluster in the relatively near future and it would be nice to be able to start with Talos

@evilhamsterman
Copy link
Author

I try not to be the one pinging for updates all the time. But I need to start deploying a bare metal Kubernetes cluster soon and I'm in a bit of a planning pickle. I'd really like to just start with Talos but can't because of the need to use Nimble for PVs. I can start with a kubeadm cluster and later migrate to Talos, but that would mean putting a bunch of effort into setting up deployment workflows that may just be abandoned shortly after. So I'm not sure how much effort I should invest in automation vs just rolling by hand for now, or using an alternative storage.

I can understand 2.5 is out of the picture, it looks like there're already betas for that. So is this planned to be included in 2.6, which based on previous release cadence we may see before EOY or perhaps a 2.5.x release? Or is this planned for a longer timeframe like next year. Just trying to get an idea to help with planning.

@datamattsson
Copy link
Collaborator

It's hard for me to gauge when we can get to a stage to support Talos and immutable nodes in general. It's very high on my list but I rarely get my way when large deals are on the table demanding feature X, Y and Z.

Also, full disclosure, we have not even scoped the next minor or patch release as we're neck deep stabilizing 2.5.0. I'll make a note and try to get it in for consideration in the next couple of releases.

If you want to email me directly at michael.mattsson at hpe.com with your company name and business relationship with HPE it will make it easier for me to talk to product management.

@datamattsson
Copy link
Collaborator

I don't have a Talos environment readily available and skimming through the docs I realize I need firewall rules or deploy a new deployment environment for Talos itself.

As a quick hack, can you tell me how far you get with this?

helm repo add datamattsson https://datamattsson.github.io/co-deployments/
helm repo update
helm install my-hpe-csi-driver -nhpe-storage datamattsson/hpe-csi-driver --version 2.5.0-talos --set disableNodeConfiguration=true

@evilhamsterman
Copy link
Author

evilhamsterman commented Jun 21, 2024

It looks like it is still mounting /etc/hpe-storage and causing failures due to the RO filesystem

Node YAML
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-06-21T18:34:59Z"
  generateName: hpe-csi-node-
  labels:
    app: hpe-csi-node
    controller-revision-hash: 6cc9c89c6b
    pod-template-generation: "1"
    role: hpe-csi
  name: hpe-csi-node-tsvkz
  namespace: hpe-storage
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: hpe-csi-node
    uid: 280184d8-2211-44a8-9829-4d182242cb65
  resourceVersion: "7099"
  uid: 29d72260-ce51-4e52-8050-f975d54eacbc
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - talos-nvj-4af
  containers:
  - args:
    - --csi-address=$(ADDRESS)
    - --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
    - --v=5
    env:
    - name: ADDRESS
      value: /csi/csi.sock
    - name: DRIVER_REG_SOCK_PATH
      value: /var/lib/kubelet/plugins/csi.hpe.com/csi.sock
    - name: KUBE_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.1
    imagePullPolicy: IfNotPresent
    name: csi-node-driver-registrar
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /registration
      name: registration-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-lw749
      readOnly: true
  - args:
    - --endpoint=$(CSI_ENDPOINT)
    - --node-service
    - --flavor=kubernetes
    - --node-monitor
    - --node-monitor-interval=30
    env:
    - name: CSI_ENDPOINT
      value: unix:///csi/csi.sock
    - name: LOG_LEVEL
      value: info
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: DISABLE_NODE_CONFIGURATION
      value: "true"
    - name: KUBELET_ROOT_DIR
      value: /var/lib/kubelet
    image: quay.io/hpestorage/csi-driver:v2.5.0-beta
    imagePullPolicy: IfNotPresent
    name: hpe-csi-driver
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
        - SYS_ADMIN
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /var/lib/kubelet
      mountPropagation: Bidirectional
      name: pods-mount-dir
    - mountPath: /host
      mountPropagation: Bidirectional
      name: root-dir
    - mountPath: /dev
      name: device-dir
    - mountPath: /var/log
      name: log-dir
    - mountPath: /etc/hpe-storage
      name: etc-hpe-storage-dir
    - mountPath: /etc/kubernetes
      name: etc-kubernetes
    - mountPath: /sys
      name: sys
    - mountPath: /run/systemd
      name: runsystemd
    - mountPath: /etc/systemd/system
      name: etcsystemd
    - mountPath: /opt/hpe-storage/nimbletune/config.json
      name: linux-config-file
      subPath: config.json
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-lw749
      readOnly: true
  dnsConfig:
    options:
    - name: ndots
      value: "1"
  dnsPolicy: ClusterFirstWithHostNet
  enableServiceLinks: true
  hostNetwork: true
  initContainers:
  - args:
    - --node-init
    - --endpoint=$(CSI_ENDPOINT)
    - --flavor=kubernetes
    env:
    - name: CSI_ENDPOINT
      value: unix:///csi/csi.sock
    image: quay.io/hpestorage/csi-driver:v2.5.0-beta
    imagePullPolicy: IfNotPresent
    name: hpe-csi-node-init
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
        - SYS_ADMIN
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host
      mountPropagation: Bidirectional
      name: root-dir
    - mountPath: /dev
      name: device-dir
    - mountPath: /sys
      name: sys
    - mountPath: /etc/hpe-storage
      name: etc-hpe-storage-dir
    - mountPath: /run/systemd
      name: runsystemd
    - mountPath: /etc/systemd/system
      name: etcsystemd
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /var/lib/kubelet
      name: pods-mount-dir
    - mountPath: /var/log
      name: log-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-lw749
      readOnly: true
  nodeName: talos-nvj-4af
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: hpe-csi-node-sa
  serviceAccountName: hpe-csi-node-sa
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: csi.hpe.com/hpe-nfs
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/lib/kubelet/plugins_registry
      type: Directory
    name: registration-dir
  - hostPath:
      path: /var/lib/kubelet/plugins/csi.hpe.com
      type: DirectoryOrCreate
    name: plugin-dir
  - hostPath:
      path: /var/lib/kubelet
      type: ""
    name: pods-mount-dir
  - hostPath:
      path: /
      type: ""
    name: root-dir
  - hostPath:
      path: /dev
      type: ""
    name: device-dir
  - hostPath:
      path: /var/log
      type: ""
    name: log-dir
  - hostPath:
      path: /etc/hpe-storage
      type: ""
    name: etc-hpe-storage-dir
  - hostPath:
      path: /etc/kubernetes
      type: ""
    name: etc-kubernetes
  - hostPath:
      path: /run/systemd
      type: ""
    name: runsystemd
  - hostPath:
      path: /etc/systemd/system
      type: ""
    name: etcsystemd
  - hostPath:
      path: /sys
      type: ""
    name: sys
  - configMap:
      defaultMode: 420
      name: hpe-linux-config
    name: linux-config-file
  - name: kube-api-access-lw749
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:35:00Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:34:59Z"
    message: 'containers with incomplete status: [hpe-csi-node-init]'
    reason: ContainersNotInitialized
    status: "False"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:34:59Z"
    message: 'containers with unready status: [csi-node-driver-registrar hpe-csi-driver]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:34:59Z"
    message: 'containers with unready status: [csi-node-driver-registrar hpe-csi-driver]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:34:59Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.1
    imageID: ""
    lastState: {}
    name: csi-node-driver-registrar
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  - image: quay.io/hpestorage/csi-driver:v2.5.0-beta
    imageID: ""
    lastState: {}
    name: hpe-csi-driver
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  hostIP: 10.100.155.236
  hostIPs:
  - ip: 10.100.155.236
  initContainerStatuses:
  - image: quay.io/hpestorage/csi-driver:v2.5.0-beta
    imageID: ""
    lastState: {}
    name: hpe-csi-node-init
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'failed to generate container "d1bfa53cdae544c0b62c5d36c001fc2f7270357ac5bcf01691257ae999dbc058"
          spec: failed to generate spec: failed to mkdir "/etc/hpe-storage": mkdir
          /etc/hpe-storage: read-only file system'
        reason: CreateContainerError
  phase: Pending
  podIP: 10.100.155.236
  podIPs:
  - ip: 10.100.155.236
  qosClass: Burstable
  startTime: "2024-06-21T18:34:59Z"
Controller YAML
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-06-21T18:42:38Z"
  generateName: hpe-csi-controller-574bc6ccf9-
  labels:
    app: hpe-csi-controller
    pod-template-hash: 574bc6ccf9
    role: hpe-csi
  name: hpe-csi-controller-574bc6ccf9-bzpb5
  namespace: hpe-storage
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: hpe-csi-controller-574bc6ccf9
    uid: 5b8c38be-808a-4293-b80a-b7780843bc8b
  resourceVersion: "7487"
  uid: 727b54dc-e0b6-4281-b6a4-4cbd297a592f
spec:
  containers:
  - args:
    - --csi-address=$(ADDRESS)
    - --v=5
    - --extra-create-metadata
    - --timeout=30s
    - --worker-threads=16
    - --feature-gates=Topology=true
    - --immediate-topology=false
    env:
    - name: ADDRESS
      value: /var/lib/csi/sockets/pluginproxy/csi.sock
    image: registry.k8s.io/sig-storage/csi-provisioner:v4.0.1
    imagePullPolicy: IfNotPresent
    name: csi-provisioner
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy
      name: socket-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  - args:
    - --v=5
    - --csi-address=$(ADDRESS)
    env:
    - name: ADDRESS
      value: /var/lib/csi/sockets/pluginproxy/csi.sock
    image: registry.k8s.io/sig-storage/csi-attacher:v4.5.1
    imagePullPolicy: IfNotPresent
    name: csi-attacher
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy
      name: socket-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  - args:
    - --v=5
    - --csi-address=$(ADDRESS)
    env:
    - name: ADDRESS
      value: /var/lib/csi/sockets/pluginproxy/csi.sock
    image: registry.k8s.io/sig-storage/csi-snapshotter:v7.0.2
    imagePullPolicy: IfNotPresent
    name: csi-snapshotter
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy/
      name: socket-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  - args:
    - --csi-address=$(ADDRESS)
    - --v=5
    env:
    - name: ADDRESS
      value: /var/lib/csi/sockets/pluginproxy/csi.sock
    image: registry.k8s.io/sig-storage/csi-resizer:v1.10.1
    imagePullPolicy: IfNotPresent
    name: csi-resizer
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy
      name: socket-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  - args:
    - --endpoint=$(CSI_ENDPOINT)
    - --flavor=kubernetes
    - --pod-monitor
    - --pod-monitor-interval=30
    env:
    - name: CSI_ENDPOINT
      value: unix:///var/lib/csi/sockets/pluginproxy/csi.sock
    - name: LOG_LEVEL
      value: info
    image: quay.io/hpestorage/csi-driver:v2.5.0-beta
    imagePullPolicy: IfNotPresent
    name: hpe-csi-driver
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy
      name: socket-dir
    - mountPath: /var/log
      name: log-dir
    - mountPath: /etc/kubernetes
      name: k8s
    - mountPath: /etc/hpe-storage
      name: hpeconfig
    - mountPath: /host
      name: root-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  - args:
    - --v=5
    - --csi-address=$(ADDRESS)
    env:
    - name: ADDRESS
      value: /var/lib/csi/sockets/pluginproxy/csi-extensions.sock
    image: quay.io/hpestorage/volume-mutator:v1.3.6-beta
    imagePullPolicy: IfNotPresent
    name: csi-volume-mutator
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy/
      name: socket-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  - args:
    - --v=5
    - --csi-address=$(ADDRESS)
    env:
    - name: ADDRESS
      value: /var/lib/csi/sockets/pluginproxy/csi-extensions.sock
    image: quay.io/hpestorage/volume-group-snapshotter:v1.0.6-beta
    imagePullPolicy: IfNotPresent
    name: csi-volume-group-snapshotter
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy/
      name: socket-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  - args:
    - --v=5
    - --csi-address=$(ADDRESS)
    env:
    - name: ADDRESS
      value: /var/lib/csi/sockets/pluginproxy/csi-extensions.sock
    image: quay.io/hpestorage/volume-group-provisioner:v1.0.6-beta
    imagePullPolicy: IfNotPresent
    name: csi-volume-group-provisioner
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy/
      name: socket-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  - args:
    - --v=5
    - --endpoint=$(CSI_ENDPOINT)
    env:
    - name: CSI_ENDPOINT
      value: unix:///var/lib/csi/sockets/pluginproxy/csi-extensions.sock
    - name: LOG_LEVEL
      value: info
    image: quay.io/hpestorage/csi-extensions:v1.2.7-beta
    imagePullPolicy: IfNotPresent
    name: csi-extensions
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/csi/sockets/pluginproxy/
      name: socket-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-djh9g
      readOnly: true
  dnsConfig:
    options:
    - name: ndots
      value: "1"
  dnsPolicy: ClusterFirstWithHostNet
  enableServiceLinks: true
  hostNetwork: true
  nodeName: talos-nvj-4af
  preemptionPolicy: PreemptLowerPriority
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: hpe-csi-controller-sa
  serviceAccountName: hpe-csi-controller-sa
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: socket-dir
  - hostPath:
      path: /var/log
      type: ""
    name: log-dir
  - hostPath:
      path: /etc/kubernetes
      type: ""
    name: k8s
  - hostPath:
      path: /etc/hpe-storage
      type: ""
    name: hpeconfig
  - hostPath:
      path: /
      type: ""
    name: root-dir
  - name: kube-api-access-djh9g
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:42:41Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:42:39Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:42:39Z"
    message: 'containers with unready status: [hpe-csi-driver]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:42:39Z"
    message: 'containers with unready status: [hpe-csi-driver]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-06-21T18:42:39Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://69d077a9414b9f622fffc550c68bf651c4ede0fc41ef85279347c363049f4f54
    image: registry.k8s.io/sig-storage/csi-attacher:v4.5.1
    imageID: registry.k8s.io/sig-storage/csi-attacher@sha256:9dcd469f02bbb7592ad61b0f848ec242f9ea2102187a0cd8407df33c2d633e9c
    lastState:
      terminated:
        containerID: containerd://dd726dccda8c6a3774e1e96060d9b1529dfebbed83667ea76e6fd85c0b995b0b
        exitCode: 1
        finishedAt: "2024-06-21T18:43:41Z"
        reason: Error
        startedAt: "2024-06-21T18:43:10Z"
    name: csi-attacher
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2024-06-21T18:43:56Z"
  - containerID: containerd://ab469a79652fd7d894e15f93528d2d92a03fa867c80f818c2575eee3ce530652
    image: quay.io/hpestorage/csi-extensions:v1.2.7-beta
    imageID: quay.io/hpestorage/csi-extensions@sha256:106637da1dad32a0ffda17f3110f5d396cc6b03ed2af63b4c5260c8ed02b1314
    lastState: {}
    name: csi-extensions
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-06-21T18:42:40Z"
  - containerID: containerd://b0a981a751c5143813a4db0b53bb2c2243312136e543a4d65b952dc61b84f5c1
    image: registry.k8s.io/sig-storage/csi-provisioner:v4.0.1
    imageID: registry.k8s.io/sig-storage/csi-provisioner@sha256:bf5a235b67d8aea00f5b8ec24d384a2480e1017d5458d8a63b361e9eeb1608a9
    lastState:
      terminated:
        containerID: containerd://443bb47ec3cb0e9093342377d75f7812422e7f62bf4d0ce5d22757c42052dc15
        exitCode: 1
        finishedAt: "2024-06-21T18:43:40Z"
        reason: Error
        startedAt: "2024-06-21T18:43:10Z"
    name: csi-provisioner
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2024-06-21T18:43:56Z"
  - containerID: containerd://bbaafc9a87726f13ca40e7b7f1973e4473dd8ce94a78c9a67ce05b7205e88553
    image: registry.k8s.io/sig-storage/csi-resizer:v1.10.1
    imageID: registry.k8s.io/sig-storage/csi-resizer@sha256:4ecda2818f6d88a8f217babd459fdac31588f85581aa95ac7092bb0471ff8541
    lastState:
      terminated:
        containerID: containerd://9d299091598a0a53213d7a92321f4ef5fc9fff1d1f88beba87b62fe35c7b7639
        exitCode: 1
        finishedAt: "2024-06-21T18:43:41Z"
        reason: Error
        startedAt: "2024-06-21T18:43:11Z"
    name: csi-resizer
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2024-06-21T18:43:56Z"
  - containerID: containerd://fe753c0bc4d762a861c59f5d557c4152e6bf85bb5495fb336e3e8a8ce57bf5e4
    image: registry.k8s.io/sig-storage/csi-snapshotter:v7.0.2
    imageID: registry.k8s.io/sig-storage/csi-snapshotter@sha256:c4b6b02737bc24906fcce57fe6626d1a36cb2b91baa971af2a5e5a919093c34e
    lastState:
      terminated:
        containerID: containerd://ec7b4e064f648cfd70c882b81a601db820d1eaf483f30867bcaaf93347d26879
        exitCode: 1
        finishedAt: "2024-06-21T18:43:41Z"
        reason: Error
        startedAt: "2024-06-21T18:43:11Z"
    name: csi-snapshotter
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2024-06-21T18:43:56Z"
  - containerID: containerd://1eb157a1a0fe1ebf3bc26ac8a6d7ee1a729fc1e1f7b04edc78664ea1294ceff0
    image: quay.io/hpestorage/volume-group-provisioner:v1.0.6-beta
    imageID: quay.io/hpestorage/volume-group-provisioner@sha256:8d1ee0f752271148c019bc6ff2db53fdbfb56dfce3ede2e8f1549952becfeb05
    lastState: {}
    name: csi-volume-group-provisioner
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-06-21T18:42:40Z"
  - containerID: containerd://42fec0266f3669000d461c690cc2c0fd74e7d8a5c0f0093a5b591c82fc3b6612
    image: quay.io/hpestorage/volume-group-snapshotter:v1.0.6-beta
    imageID: quay.io/hpestorage/volume-group-snapshotter@sha256:9be38de0f93f6b4ce7d0456eaabf5da3890b094a89a7b811852d31fbaf76c79c
    lastState: {}
    name: csi-volume-group-snapshotter
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-06-21T18:42:40Z"
  - containerID: containerd://ae0dce20062d444aa8a124fe753bcc200c1b8008a3a4ef800e7b4500fc73b861
    image: quay.io/hpestorage/volume-mutator:v1.3.6-beta
    imageID: quay.io/hpestorage/volume-mutator@sha256:247153bb789805c272b76fd8018ccd0f8bf4eabded5d4baf362d8a2c162b8672
    lastState: {}
    name: csi-volume-mutator
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-06-21T18:42:40Z"
  - image: quay.io/hpestorage/csi-driver:v2.5.0-beta
    imageID: ""
    lastState: {}
    name: hpe-csi-driver
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'failed to generate container "ee110797b0f68f31aa64c448b04f663590359bc4181a08be4f764f4dd599941f"
          spec: failed to generate spec: failed to mkdir "/etc/hpe-storage": mkdir
          /etc/hpe-storage: read-only file system'
        reason: CreateContainerError
  hostIP: 10.100.155.236
  hostIPs:
  - ip: 10.100.155.236
  phase: Pending
  podIP: 10.100.155.236
  podIPs:
  - ip: 10.100.155.236
  qosClass: Burstable
  startTime: "2024-06-21T18:42:39Z"

@datamattsson
Copy link
Collaborator

Ok, I had a brain fart, try now.

helm uninstall my-hpe-csi-driver -nhpe-storage
helm repo update
helm install my-hpe-csi-driver -nhpe-storage datamattsson/hpe-csi-driver --version 2.5.0-talos2 --set disableNodeConfiguration=true

@evilhamsterman
Copy link
Author

Getting closer, the controller started fine but the hpe-csi-node daemonset pod is still trying to mount /etc/systemd/system

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  2m2s                default-scheduler  Successfully assigned hpe-storage/hpe-csi-node-qv9xk to talos-nvj-4af
  Warning  Failed     2m2s                kubelet            Error: failed to generate container "28b5218a6cea8f05806ec4210312762aa45cc1a851befe51d3e231bb6ff95fa2" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     2m2s                kubelet            Error: failed to generate container "5f4e20edc65d2a0990d99c0b5da15cf61f3c0273d577f1bacacbbcc49bf77ff5" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     108s                kubelet            Error: failed to generate container "1a59382b30bcc28fca08f6b48cf9ccce5adee2d003634ab00a59c9d470ad0a3c" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     97s                 kubelet            Error: failed to generate container "bdcdb9ac2dac778320a6f1fccfa7e0198ceb9f62cce3ab03ca59b7f061442133" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     85s                 kubelet            Error: failed to generate container "97701cc024c101137235529d83b03f1461e1dd97e48c543ac5d72474362e739d" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     74s                 kubelet            Error: failed to generate container "3176d754668a42fc845d93ef4ca8b116bd59f67ec35983626e9901f70099b219" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     61s                 kubelet            Error: failed to generate container "e0d1cee086f4f574cf0e9eee92da6ba94dbaa359990e92068ec6926dd8e16d03" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     46s                 kubelet            Error: failed to generate container "f6ab37bf1edc712984ff69f9f5da848a5eb6e4cf1bec0efa8cc697cc4f776e8b" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     34s                 kubelet            Error: failed to generate container "c641f39c98980fccbac986b9c4bf7d35b2b226fc70fc12e71c54dc50b672bd77" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Normal   Pulled     7s (x11 over 2m2s)  kubelet            Container image "quay.io/hpestorage/csi-driver:v2.5.0-beta" already present on machine
  Warning  Failed     7s (x2 over 21s)    kubelet            (combined from similar events): Error: failed to generate container "6d45577bdd7ca1971a3eba9b3c110ea41001ed5d08cdfb91792fac458da31a37" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system

@evilhamsterman
Copy link
Author

I did ensure disableNodeConfiguration is set

❯ helm get values my-hpe-csi-driver
USER-SUPPLIED VALUES:
disableNodeConfiguration: true

@datamattsson
Copy link
Collaborator

Ok, I here's the next one. 2.5.0-talos3

helm uninstall my-hpe-csi-driver -nhpe-storage
helm repo update
helm install my-hpe-csi-driver -nhpe-storage datamattsson/hpe-csi-driver --version 2.5.0-talos3 --set disableNodeConfiguration=true

@evilhamsterman
Copy link
Author

The pod starts but the initContainer immediately crashes

hpe-csi-node-init + '[' --endpoint=unix:///csi/csi.sock = --node-init ']'
hpe-csi-node-init + for arg in "$@"
hpe-csi-node-init + '[' --flavor=kubernetes = --node-service ']'
hpe-csi-node-init + '[' --flavor=kubernetes = --node-init ']'
hpe-csi-node-init + disableNodeConformance=
hpe-csi-node-init + disableNodeConfiguration=
hpe-csi-node-init + '[' true = true ']'
hpe-csi-node-init + '[' '' = true ']'
hpe-csi-node-init + '[' '' = true ']'
hpe-csi-node-init + '[' '' '!=' true ']'
hpe-csi-node-init + cp -f /opt/hpe-storage/lib/hpe-storage-node.service /etc/systemd/system/hpe-storage-node.service
hpe-csi-node-init + cp -f /opt/hpe-storage/lib/hpe-storage-node.sh /etc/hpe-storage/hpe-storage-node.sh
hpe-csi-node-init cp: cannot create regular file '/etc/hpe-storage/hpe-storage-node.sh': No such file or directory

@evilhamsterman
Copy link
Author

evilhamsterman commented Jun 21, 2024

I looks like the DISABLE_NODE_CONFIGURATION environment variable is not getting set on the initContainer

spec:
  initContainers:
  - args:
    - --node-init
    - --endpoint=$(CSI_ENDPOINT)
    - --flavor=kubernetes
    env:
    - name: CSI_ENDPOINT
      value: unix:///csi/csi.sock
    image: quay.io/hpestorage/csi-driver:v2.5.0-beta
    imagePullPolicy: IfNotPresent
    name: hpe-csi-node-init
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
        - SYS_ADMIN
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host
      mountPropagation: Bidirectional
      name: root-dir
    - mountPath: /dev
      name: device-dir
    - mountPath: /sys
      name: sys
    - mountPath: /run/systemd
      name: runsystemd
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /var/lib/kubelet
      name: pods-mount-dir
    - mountPath: /var/log
      name: log-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xsr7w
      readOnly: true

@datamattsson
Copy link
Collaborator

This is very interesting, I think you just uncovered a different bug altogether. =)

@datamattsson
Copy link
Collaborator

datamattsson commented Jun 21, 2024

Ok, talos4 has been published.

helm uninstall my-hpe-csi-driver -nhpe-storage
helm repo update
helm install my-hpe-csi-driver -nhpe-storage datamattsson/hpe-csi-driver --version 2.5.0-talos4 --set disableNodeConfiguration=true

@evilhamsterman
Copy link
Author

evilhamsterman commented Jun 21, 2024

I edited the DS to add the environment variable and used your latest update. The initContainer succeeds now but then I think we get to meat of the situation the csi-node-driver-registrar starts crashing and the hpe-csi-driver container complains it can't find initiators. It looks like part of the problem is on the Talos Side their iscsi-tools extension doesn't appear to include the multipath command siderolabs/extensions#134. Though democratic-csi claims that it's not needed I'm not an expert in iSCSI so I can't say how true democratic-csi/democratic-csi#225 (comment)

Container logs
hpe-csi-driver + '[' --endpoint=unix:///csi/csi.sock = --node-service ']'
hpe-csi-driver + '[' --endpoint=unix:///csi/csi.sock = --node-init ']'
csi-node-driver-registrar I0621 21:31:10.477623       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
csi-node-driver-registrar I0621 21:31:10.477723       1 connection.go:215] Connecting to unix:///csi/csi.sock
csi-node-driver-registrar I0621 21:31:10.478469       1 main.go:164] Calling CSI driver to discover driver name
hpe-csi-driver + for arg in "$@"
hpe-csi-driver + '[' --node-service = --node-service ']'
hpe-csi-driver + nodeService=true
hpe-csi-driver + '[' --node-service = --node-init ']'
hpe-csi-driver + for arg in "$@"
hpe-csi-node-init + for arg in "$@"
hpe-csi-node-init + '[' --endpoint=unix:///csi/csi.sock = --node-service ']'
hpe-csi-node-init + '[' --endpoint=unix:///csi/csi.sock = --node-init ']'
hpe-csi-node-init + for arg in "$@"
hpe-csi-node-init + '[' --flavor=kubernetes = --node-service ']'
hpe-csi-node-init + '[' --flavor=kubernetes = --node-init ']'
hpe-csi-node-init + disableNodeConformance=
hpe-csi-node-init + disableNodeConfiguration=true
hpe-csi-node-init + '[' true = true ']'
hpe-csi-node-init + '[' '' = true ']'
csi-node-driver-registrar I0621 21:31:10.478557       1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginInfo
csi-node-driver-registrar I0621 21:31:10.478586       1 connection.go:245] GRPC request: {}
csi-node-driver-registrar I0621 21:31:10.481052       1 connection.go:251] GRPC response: {"name":"csi.hpe.com","vendor_version":"1.3"}
csi-node-driver-registrar I0621 21:31:10.481063       1 connection.go:252] GRPC error: <nil>
hpe-csi-node-init + '[' true = true ']'
hpe-csi-node-init + echo 'Node configuration is disabled'
hpe-csi-node-init + disableConformanceCheck=true
hpe-csi-node-init + '[' true '!=' true ']'
hpe-csi-node-init + exec /bin/csi-driver --node-init --endpoint=unix:///csi/csi.sock --flavor=kubernetes
hpe-csi-node-init Node configuration is disabled
hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg="Initialized logging." alsoLogToStderr=true logFileLocation=/var/log/hpe-csi-controller.log logLevel=info
hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg="**********************************************" file="csi-driver.go:56"
hpe-csi-driver + '[' --flavor=kubernetes = --node-service ']'
hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg="*************** HPE CSI DRIVER ***************" file="csi-driver.go:57"
hpe-csi-driver + '[' --flavor=kubernetes = --node-init ']'
hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg="**********************************************" file="csi-driver.go:58"
hpe-csi-driver + for arg in "$@"
hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg=">>>>> CMDLINE Exec, args: ]" file="csi-driver.go:60"
hpe-csi-driver + '[' --node-monitor = --node-service ']'
hpe-csi-node-init W0621 21:31:06.910459       1 reflector.go:424] hpe-csi-driver/pkg/flavor/kubernetes/flavor.go:145: failed to list *v1.VolumeSnapshot: volumesnapshots.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:hpe-storage:hpe-csi-node-sa" cannot list resource "volumesnapshots" in API group "snapshot.storage.k8s.io" at the cluster scope
hpe-csi-driver + '[' --node-monitor = --node-init ']'
hpe-csi-driver + for arg in "$@"
hpe-csi-driver + '[' --node-monitor-interval=30 = --node-service ']'
hpe-csi-driver + '[' --node-monitor-interval=30 = --node-init ']'
hpe-csi-driver + disableNodeConformance=
hpe-csi-driver + disableNodeConfiguration=true
hpe-csi-driver + '[' '' = true ']'
hpe-csi-driver + '[' true = true ']'
hpe-csi-driver + echo 'copying hpe log collector diag script'
hpe-csi-driver copying hpe log collector diag script
hpe-csi-driver + cp -f /opt/hpe-storage/bin/hpe-logcollector.sh /usr/local/bin/hpe-logcollector.sh
hpe-csi-driver + chmod +x /usr/local/bin/hpe-logcollector.sh
hpe-csi-driver + '[' '!' -f /host/etc/multipath.conf ']'
hpe-csi-driver + '[' true '!=' true ']'
hpe-csi-driver + ln -s /host/etc/multipath.conf /etc/multipath.conf
hpe-csi-driver + ln -s /host/etc/multipath /etc/multipath
hpe-csi-driver + ln -s /host/etc/iscsi /etc/iscsi
hpe-csi-driver + '[' -f /host/etc/redhat-release ']'
hpe-csi-driver + '[' -f /host/etc/os-release ']'
hpe-csi-driver + rm /etc/os-release
csi-node-driver-registrar I0621 21:31:10.481070       1 main.go:173] CSI driver name: "csi.hpe.com"
csi-node-driver-registrar I0621 21:31:10.481110       1 node_register.go:55] Starting Registration Server at: /registration/csi.hpe.com-reg.sock
csi-node-driver-registrar I0621 21:31:10.481557       1 node_register.go:64] Registration Server started at: /registration/csi.hpe.com-reg.sock
csi-node-driver-registrar I0621 21:31:10.481696       1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
csi-node-driver-registrar I0621 21:31:11.759238       1 main.go:90] Received GetInfo call: &InfoRequest{}
csi-node-driver-registrar I0621 21:31:11.777891       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = Failed to get initiators for host,}
csi-node-driver-registrar E0621 21:31:11.778006       1 main.go:103] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = Failed to get initiators for host, restarting registration container.
hpe-csi-node-init E0621 21:31:06.910696       1 reflector.go:140] hpe-csi-driver/pkg/flavor/kubernetes/flavor.go:145: Failed to watch *v1.VolumeSnapshot: failed to list *v1.VolumeSnapshot: volumesnapshots.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:hpe-storage:hpe-csi-node-sa" cannot list resource "volumesnapshots" in API group "snapshot.storage.k8s.io" at the cluster scope
hpe-csi-node-init E0621 21:31:06.910770       1 reflector.go:140] hpe-csi-driver/pkg/flavor/kubernetes/flavor.go:127: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
hpe-csi-node-init time="2024-06-21T21:31:06Z" level=error msg="process with pid : 11 finished with error = exit status 127" file="cmd.go:63"
hpe-csi-node-init time="2024-06-21T21:31:06Z" level=error msg="Error while getting the multipath devices on the node " file="utils.go:11"
hpe-csi-driver + ln -s /host/etc/os-release /etc/os-release
hpe-csi-driver + echo 'starting csi plugin...'
hpe-csi-driver + exec /bin/csi-driver --endpoint=unix:///csi/csi.sock --node-service --flavor=kubernetes --node-monitor --node-monitor-interval=30
hpe-csi-driver starting csi plugin...
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Initialized logging." alsoLogToStderr=true logFileLocation=/var/log/hpe-csi-node.log logLevel=info
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="**********************************************" file="csi-driver.go:56"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="*************** HPE CSI DRIVER ***************" file="csi-driver.go:57"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="**********************************************" file="csi-driver.go:58"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg=">>>>> CMDLINE Exec, args: ]" file="csi-driver.go:60"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Node configuration is disabled, DISABLE_NODE_CONFIGURATION=true.Skipping the Multipath and ISCSI configurations" file="csi-driver.go:142"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="NODE MONITOR: &{flavor:0xc0001f4d10 intervalSec:30 lock:{state:0 sema:0} started:false stopChannel:<nil> done:<nil> nodeName:talos-nvj-4af}" file="nodemonitor.go:26"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: CREATE_DELETE_VOLUME" file="driver.go:250"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME" file="driver.go:250"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: LIST_VOLUMES" file="driver.go:250"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: CREATE_DELETE_SNAPSHOT" file="driver.go:250"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: LIST_SNAPSHOTS" file="driver.go:250"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: CLONE_VOLUME" file="driver.go:250"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: PUBLISH_READONLY" file="driver.go:250"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: EXPAND_VOLUME" file="driver.go:250"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling node service capability: STAGE_UNSTAGE_VOLUME" file="driver.go:267"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling node service capability: EXPAND_VOLUME" file="driver.go:267"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling node service capability: GET_VOLUME_STATS" file="driver.go:267"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume expansion type: ONLINE" file="driver.go:281"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: SINGLE_NODE_WRITER" file="driver.go:293"
Stream closed EOF for hpe-storage/hpe-csi-node-pn769 (csi-node-driver-registrar)
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: SINGLE_NODE_READER_ONLY" file="driver.go:293"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: MULTI_NODE_READER_ONLY" file="driver.go:293"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: MULTI_NODE_SINGLE_WRITER" file="driver.go:293"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: MULTI_NODE_MULTI_WRITER" file="driver.go:293"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="DB service disabled!!!" file="driver.go:145"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="About to start the CSI driver 'csi.hpe.com with KubeletRootDirectory /var/lib/kubelet/'" file="csi-driver.go:186"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="[1] reply  : [/bin/csi-driver --endpoint=unix:///csi/csi.sock --node-service --flavor=kubernetes --node-monitor --node-monitor-interval=30]" file="csi-driver.go:189"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Listening for connections on address: &net.UnixAddr{Name:\"//csi/csi.sock\", Net:\"unix\"}" file="server.go:86"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Scheduled ephemeral inline volumes scrubber task to run every 3600 seconds, PodsDirPath: [/var/lib/kubelet/pods]" file="driver.go:214"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg=">>>>> Scrubber task invoked at 2024-06-21 21:31:07.639939957 +0000 UTC m=+0.038113292" file="driver.go:746"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="No ephemeral inline volumes found" file="driver.go:815"
hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="<<<<< Scrubber task completed at 2024-06-21 21:31:07.644313576 +0000 UTC m=+0.042486921" file="driver.go:751"
hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg="GRPC call: /csi.v1.Identity/GetPluginInfo" file="utils.go:69"
hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg="GRPC request: {}" file="utils.go:70"
hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg=">>>>> GetPluginInfo" file="identity_server.go:16"
hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg="<<<<< GetPluginInfo" file="identity_server.go:19"
hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg="GRPC response: {\"name\":\"csi.hpe.com\",\"vendor_version\":\"1.3\"}" file="utils.go:75"
hpe-csi-driver time="2024-06-21T21:31:09Z" level=info msg="GRPC call: /csi.v1.Node/NodeGetInfo" file="utils.go:69"
hpe-csi-driver time="2024-06-21T21:31:09Z" level=info msg="GRPC request: {}" file="utils.go:70"
hpe-csi-driver time="2024-06-21T21:31:09Z" level=info msg="Writing uuid to file:/etc/hpe-storage/node.gob uuid:fb4b2815-d7b3-8e09-bf95-39eb01fb29ed" file="chapidriver_linux.go:52"
hpe-csi-driver time="2024-06-21T21:31:09Z" level=error msg="process with pid : 20 finished with error = exit status 127" file="cmd.go:63"
hpe-csi-driver time="2024-06-21T21:31:09Z" level=info msg="Host name reported as talos-nvj-4af" file="node_server.go:2087"
hpe-csi-driver time="2024-06-21T21:31:09Z" level=warning msg="no fc adapters found on the host" file="fc.go:49"
hpe-csi-driver time="2024-06-21T21:31:09Z" level=error msg="Failed to get initiators for host talos-nvj-4af.  Error: iscsi and fc initiators not found" file="node_server.go:2091"
hpe-csi-driver time="2024-06-21T21:31:09Z" level=error msg="GRPC error: rpc error: code = Internal desc = Failed to get initiators for host" file="utils.go:73"
hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg="GRPC call: /csi.v1.Identity/GetPluginInfo" file="utils.go:69"
hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg="GRPC request: {}" file="utils.go:70"
hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg=">>>>> GetPluginInfo" file="identity_server.go:16"
hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg="<<<<< GetPluginInfo" file="identity_server.go:19"
hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg="GRPC response: {\"name\":\"csi.hpe.com\",\"vendor_version\":\"1.3\"}" file="utils.go:75"
hpe-csi-driver time="2024-06-21T21:31:11Z" level=info msg="GRPC call: /csi.v1.Node/NodeGetInfo" file="utils.go:69"
hpe-csi-driver time="2024-06-21T21:31:11Z" level=info msg="GRPC request: {}" file="utils.go:70"
hpe-csi-driver time="2024-06-21T21:31:11Z" level=info msg="Host name reported as talos-nvj-4af" file="node_server.go:2087"
hpe-csi-driver time="2024-06-21T21:31:11Z" level=warning msg="no fc adapters found on the host" file="fc.go:49"
hpe-csi-driver time="2024-06-21T21:31:11Z" level=error msg="process with pid : 23 finished with error = exit status 127" file="cmd.go:63"
hpe-csi-driver time="2024-06-21T21:31:11Z" level=error msg="Failed to get initiators for host talos-nvj-4af.  Error: iscsi and fc initiators not found" file="node_server.go:2091"
hpe-csi-driver time="2024-06-21T21:31:11Z" level=error msg="GRPC error: rpc error: code = Internal desc = Failed to get initiators for host" file="utils.go:73"
Stream closed EOF for hpe-storage/hpe-csi-node-pn769 (hpe-csi-node-init)

@evilhamsterman
Copy link
Author

Not sure how much help it is, but looking at your code it looks like perhaps the main issue is you're looking for the /etc/iscsi/initiatorname.iscsi file but that file doesn't exist in the normal placed in their system. Their extension bind mounts /usr/local/etc/iscsi/iscsid.con into the extension container at /etc/iscsi/iscsid.conf https://github.com/siderolabs/extensions/blob/f0b6082466dc78a309d1e9a7d8525497d714d4d4/storage/iscsi-tools/iscsid.yaml#L52C5-L53C42 but it doesn't mount the rest of the iSCSI folder so the initiator name is not accessible to you.

Looks to me like they need to mount the full /usr/local/etc/iscsi directory so that your driver can access that file, I assume that's how you get the imitator to register with the storage.

@evilhamsterman
Copy link
Author

EUREKA! I found it they do mount the /etc/iscsi directory into /system/iscsi on the host, I shelled into the hpe-csi-node/hpe-csi-driver container and changed the link from /etc/iscsi -> /host/etc/iscsi to /host/system/iscsi when the registrar next restarted the driver container was able to find the initiator name and everything is now running.

❯ k get pods
NAME                                  READY   STATUS    RESTARTS        AGE
hpe-csi-controller-8447c48d9f-rjd49   9/9     Running   0               22m
hpe-csi-node-5t69x                    2/2     Running   9 (5m45s ago)   22m
nimble-csp-74776998b6-fmcn2           1/1     Running   0               22m
primera3par-csp-58dd48cccb-lvvjb      1/1     Running   0               22m

obviously that will break when that pod restarts. But I then created a storage class and a PVC and it worked right away

❯ k get pvc
NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     VOLUMEATTRIBUTESCLASS   AGE
my-first-pvc   Bound    pvc-dc881628-ffe4-42c9-951e-e266502dd226   32Gi       RWO            csq-it-nimble1   <unset>                 2m5s

and I can see the volume on the array.

The last step mounting it is the one remaining issue. It does not successfully mount the volume. That appears to actually be related to the multipath I mentioned. I'm signing off for the weekend I'll look more on Monday

@datamattsson
Copy link
Collaborator

I should've researched this but is /usr/local/etc writable in Talos? (or, what directory IS writable and/or persistent on Talos?) I'm thinking we could just add a Helm chart parameter for CSI driver users to relocate /etc to whatever directory on the node.

As for commands the CSI driver needs to have availabe, look for clues here: https://github.com/hpe-storage/csi-driver/blob/master/Dockerfile

As for the multipath issue is that the HPE CSI Driver require multipath/multipathd on the host, there's no workaround as we don't even consider non-multipath entries.

I'm out of pocket of the rest of the weekend as well, cheers!

@evilhamsterman
Copy link
Author

I've exhausted the time I can work on this for now. But this is what I found messing around some more. Hopefully it can help you get on the correct path, but it certainly looks like it's going to require more work than just changing the mount location. It does give me a little better idea on my planning though, I'll probably need to plan on a longer time for support.

It looks like /system is supposed to be the location for "persistent" data but it appears they mean persistent for the extension container lifecycle. The data is persistent if you restart the extension but it is not persistent over reboots. The path /system/state which contains the node config is persistent, and /var which is used as the storage for container images is persistent across reboots but I believe is not guaranteed.

However because the extensions are not persistent across reboots things like the Initiator name are not consistent, a new one is generated on every boot. Because of this I don't think it's a good idea to try and persist your node ID on disk like we discussed earlier. Either that should be generated dynamically or use the Kubernetes Node ID and store extra persistent data in a configmap or crd. In my opinion this is more in line with the general idea of Kubernetes anyway and cattle vs pets workflows.

Overall I see two maybe three major problems. One will require changes from Talos, the other will require work on your driver

  1. Their iscsi-tools extension doesn't include multipath support, as I mentioned above. I've commented on their ticket hopefully we can get some attention from them.
  2. Because the iscsi-tools runs as an OS level container it also has a very limited subset of tools. I was able to get iscsiadm to work by changing the chroot script to use nsenter instead, though maybe it would work without using env.
#!/bin/bash

iscsi_pid=$(pgrep -f "iscsid -f")

nsenter --mount="/proc/$iscsi_pid/ns/mnt" --net="/proc/$iscsi_pid/ns/net" -- /usr/local/sbin/iscsiadm "${@:1}"
  1. You may have to use your own binaries for some operations and/or limit FS support to just XFS with Talos. The host system has binaries for xfs and vfat, but no mount or ext4/btrfs binaries. Here's all the binaries available on the host, note the iscsi one come from the iscsi-tools. Here's a dump of all the available binaries on the host, this includes the ones from the iscsi-tools extension
/ # find /host -name "*bin" -type d 2>/dev/null | grep -v var | grep -v container | xargs ls
/host/bin:
containerd               containerd-shim-runc-v2
containerd-shim          runc

/host/opt/cni/bin:
bandwidth    firewall     ipvlan       ptp          tuning
bridge       flannel      loopback     sbr          vlan
dhcp         host-device  macvlan      static       vrf
dummy        host-local   portmap      tap

/host/sbin:
blkdeactivate             lvm                       udevadm
dashboard                 lvm_import_vdo            udevd
dmsetup                   lvmconfig                 vgcfgbackup
dmstats                   lvmdevices                vgcfgrestore
dmstats.static            lvmdiskscan               vgchange
fsadm                     lvmdump                   vgck
fsck.xfs                  lvmsadc                   vgconvert
init                      lvmsar                    vgcreate
ip6tables                 lvreduce                  vgdisplay
ip6tables-apply           lvremove                  vgexport
ip6tables-legacy          lvrename                  vgextend
ip6tables-legacy-restore  lvresize                  vgimport
ip6tables-legacy-save     lvs                       vgimportclone
ip6tables-restore         lvscan                    vgimportdevices
ip6tables-save            mkfs.xfs                  vgmerge
iptables                  modprobe                  vgmknodes
iptables-apply            poweroff                  vgreduce
iptables-legacy           pvchange                  vgremove
iptables-legacy-restore   pvck                      vgrename
iptables-legacy-save      pvcreate                  vgs
iptables-restore          pvdisplay                 vgscan
iptables-save             pvmove                    vgsplit
lvchange                  pvremove                  wrapperd
lvconvert                 pvresize                  xfs_repair
lvcreate                  pvs                       xtables-legacy-multi
lvdisplay                 pvscan
lvextend                  shutdown

/host/usr/bin:
udevadm

/host/usr/local/bin:

/host/usr/local/sbin:
brcm_iscsiuio            iscsi_offload            iscsiuio
iscsi-gen-initiatorname  iscsiadm                 tgtadm
iscsi-iname              iscsid                   tgtd
iscsi_discovery          iscsid-wrapper           tgtimg
iscsi_fw_login           iscsistart

/host/usr/sbin:
cryptsetup      mkfs.fat        xfs_freeze      xfs_ncheck
dosfsck         mkfs.msdos      xfs_fsr         xfs_quota
dosfslabel      mkfs.vfat       xfs_growfs      xfs_rtcp
fatlabel        veritysetup     xfs_info        xfs_scrub
fsck.fat        xfs_admin       xfs_io          xfs_scrub_all
fsck.msdos      xfs_bmap        xfs_logprint    xfs_spaceman
fsck.vfat       xfs_copy        xfs_mdrestore
integritysetup  xfs_db          xfs_metadump
mkdosfs         xfs_estimate    xfs_mkfile

@datamattsson
Copy link
Collaborator

Thanks for the additional context. This definitely needs more work. I'm just puzzled how we can't even persist an IQN on the host though? Do we need to grab the first boot one and store in our CRD and regenerate the host IQN from that?

I guess FC wouldn't have as many problems but we still would need multipath/multipathd regardless. Not having ext4 available will also create problems for our NFS server implementation for RWX claims that doesn't play nicely with XFS in failure scenarios.

@evilhamsterman
Copy link
Author

Thanks for the additional context. This definitely needs more work. I'm just puzzled how we can't even persist an IQN on the host though? Do we need to grab the first boot one and store in our CRD and regenerate the host IQN from that?
It doesn't look like you can manage the IQN, their service generates one itself.

Just my thoughts I can think of two ways to deal with it

  1. Don't care about it. Have the controller only add an IQN to the array when needed and remove when not needed. So for example you are running a DB on a node with a PV but then you drain the node so the DB gets rescheduled on a different node and the old node is no longer needed so the controller removes it from the array. The controller would also keep track of known initiators and occasionally check to see if they are still live in the Kubernetes cluster and remove from the array if not to catch cases where nodes disappear. This would be the cattle/pets option
  2. Use the new ExtensionServiceConfig to specify IQNs, https://www.talos.dev/v1.7/reference/configuration/extensions/extensionserviceconfig/. This would require that Talos add support for it and administrators would have to generate IQNs for their systems which could be error prone.

I guess FC wouldn't have as many problems but we still would need multipath/multipathd regardless. Not having ext4 available will also create problems for our NFS server implementation for RWX claims that doesn't play nicely with XFS in failure scenarios.

Looking around at other CSI iSCSI implementations it looks like many of them use their own mkfs and mount binaries rather than rely on the host

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants