Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout operator with mimir-distributed helm chart not upgrading Pods #14

Closed
krajorama opened this issue Apr 28, 2022 · 3 comments
Closed

Comments

@krajorama
Copy link
Contributor

Reproduction steps:

Install mimir from grafana/helm-charts#1205 , enable for example store-gateway zone aware replication , i.e. via custome values.yaml:

rollout_operator:
  enabled: true
store_gateway:
  zone_aware_replication:
    enabled: true

After installation, write a letter into the mimir.config , just to alter its checksum.

Expected (works without rollout op): store-gateway Pods are restarted to take in the new configuration.

Actual: nothing happens, Pods are not restarted.

Additional info:
Rollout operator prints reconciled store-gateway statefulsets messages.

Before change to config, the statefullset state is:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    checksum/config: 7fc741ee52baf2c3d69f77aed2cda62113c36f9c878b86742f5f02a4cfd1427a
    meta.helm.sh/release-name: krajo
    meta.helm.sh/release-namespace: dev
    rollout-max-unavailable: "10"
  creationTimestamp: "2022-04-28T16:16:58Z"
  generation: 2
  labels:
    app.kubernetes.io/component: store-gateway
    app.kubernetes.io/instance: krajo
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mimir
    app.kubernetes.io/part-of: memberlist
    app.kubernetes.io/version: 2.0.0
    helm.sh/chart: mimir-distributed-2.0.9
    rollout-group: store-gateway
    zone: zone-a
  name: krajo-mimir-store-gateway-zone-a
  namespace: dev
  resourceVersion: "2897289"
  selfLink: /apis/apps/v1/namespaces/dev/statefulsets/krajo-mimir-store-gateway-zone-a
  uid: 88d76d77-f8e2-4023-8775-b460b078d4a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: store-gateway
      app.kubernetes.io/instance: krajo
      app.kubernetes.io/name: mimir
      rollout-group: store-gateway
      zone: zone-a
  serviceName: krajo-mimir-store-gateway-headless
  template:
    metadata:
      annotations:
        checksum/config: 7fc741ee52baf2c3d69f77aed2cda62113c36f9c878b86742f5f02a4cfd1427a
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: store-gateway
        app.kubernetes.io/instance: krajo
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: mimir
        app.kubernetes.io/part-of: memberlist
        app.kubernetes.io/version: 2.0.0
        helm.sh/chart: mimir-distributed-2.0.9
        rollout-group: store-gateway
        zone: zone-a
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: target
                operator: In
                values:
                - store-gateway
              - key: target
                operator: NotIn
                values:
                - store-gateway-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -target=store-gateway
        - -config.file=/etc/mimir/mimir.yaml
        - -store-gateway.sharding-ring.instance-availability-zone=zone-a
        image: grafana/mimir:2.0.0
        imagePullPolicy: IfNotPresent
        name: store-gateway
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http-metrics
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: krajo-mimir
      serviceAccountName: krajo-mimir
      terminationGracePeriodSeconds: 240
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: krajo-mimir-config
      - configMap:
          defaultMode: 420
          name: krajo-mimir-runtime
        name: runtime-config
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: microk8s-hostpath
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentRevision: krajo-mimir-store-gateway-zone-a-6795c75577
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  updateRevision: krajo-mimir-store-gateway-zone-a-6795c75577

After the upgrade:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    checksum/config: 58576abe6a9fb051e078095d03be1c4c3906e36da1651cf7fdf6cd8c39c30171
    meta.helm.sh/release-name: krajo
    meta.helm.sh/release-namespace: dev
    rollout-max-unavailable: "10"
  creationTimestamp: "2022-04-28T16:16:58Z"
  generation: 3
  labels:
    app.kubernetes.io/component: store-gateway
    app.kubernetes.io/instance: krajo
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mimir
    app.kubernetes.io/part-of: memberlist
    app.kubernetes.io/version: 2.0.0
    helm.sh/chart: mimir-distributed-2.0.9
    rollout-group: store-gateway
    zone: zone-a
  name: krajo-mimir-store-gateway-zone-a
  namespace: dev
  resourceVersion: "2902316"
  selfLink: /apis/apps/v1/namespaces/dev/statefulsets/krajo-mimir-store-gateway-zone-a
  uid: 88d76d77-f8e2-4023-8775-b460b078d4a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: store-gateway
      app.kubernetes.io/instance: krajo
      app.kubernetes.io/name: mimir
      rollout-group: store-gateway
      zone: zone-a
  serviceName: krajo-mimir-store-gateway-headless
  template:
    metadata:
      annotations:
        checksum/config: 58576abe6a9fb051e078095d03be1c4c3906e36da1651cf7fdf6cd8c39c30171
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: store-gateway
        app.kubernetes.io/instance: krajo
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: mimir
        app.kubernetes.io/part-of: memberlist
        app.kubernetes.io/version: 2.0.0
        helm.sh/chart: mimir-distributed-2.0.9
        rollout-group: store-gateway
        zone: zone-a
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: target
                operator: In
                values:
                - store-gateway
              - key: target
                operator: NotIn
                values:
                - store-gateway-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -target=store-gateway
        - -config.file=/etc/mimir/mimir.yaml
        - -store-gateway.sharding-ring.instance-availability-zone=zone-a
        image: grafana/mimir:2.0.0
        imagePullPolicy: IfNotPresent
        name: store-gateway
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http-metrics
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: krajo-mimir
      serviceAccountName: krajo-mimir
      terminationGracePeriodSeconds: 240
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: krajo-mimir-config
      - configMap:
          defaultMode: 420
          name: krajo-mimir-runtime
        name: runtime-config
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: microk8s-hostpath
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentRevision: krajo-mimir-store-gateway-zone-a-6b64ccdc98
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updateRevision: krajo-mimir-store-gateway-zone-a-6b64ccdc98

I've added the checksum on statefulset itself as annotation but didn't help.

@krajorama
Copy link
Contributor Author

With rollout operator killed off, after another update of the config:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    checksum/config: eb54c06d95c2e592f6c00fef442070c26c355f3178d03cbaab32c149534b0b3a
    meta.helm.sh/release-name: krajo
    meta.helm.sh/release-namespace: dev
    rollout-max-unavailable: "10"
  creationTimestamp: "2022-04-28T16:16:58Z"
  generation: 4
  labels:
    app.kubernetes.io/component: store-gateway
    app.kubernetes.io/instance: krajo
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mimir
    app.kubernetes.io/part-of: memberlist
    app.kubernetes.io/version: 2.0.0
    helm.sh/chart: mimir-distributed-2.0.9
    rollout-group: store-gateway
    zone: zone-a
  name: krajo-mimir-store-gateway-zone-a
  namespace: dev
  resourceVersion: "2905246"
  selfLink: /apis/apps/v1/namespaces/dev/statefulsets/krajo-mimir-store-gateway-zone-a
  uid: 88d76d77-f8e2-4023-8775-b460b078d4a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: store-gateway
      app.kubernetes.io/instance: krajo
      app.kubernetes.io/name: mimir
      rollout-group: store-gateway
      zone: zone-a
  serviceName: krajo-mimir-store-gateway-headless
  template:
    metadata:
      annotations:
        checksum/config: eb54c06d95c2e592f6c00fef442070c26c355f3178d03cbaab32c149534b0b3a
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: store-gateway
        app.kubernetes.io/instance: krajo
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: mimir
        app.kubernetes.io/part-of: memberlist
        app.kubernetes.io/version: 2.0.0
        helm.sh/chart: mimir-distributed-2.0.9
        rollout-group: store-gateway
        zone: zone-a
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: target
                operator: In
                values:
                - store-gateway
              - key: target
                operator: NotIn
                values:
                - store-gateway-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -target=store-gateway
        - -config.file=/etc/mimir/mimir.yaml
        - -store-gateway.sharding-ring.instance-availability-zone=zone-a
        image: grafana/mimir:2.0.0
        imagePullPolicy: IfNotPresent
        name: store-gateway
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http-metrics
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: krajo-mimir
      serviceAccountName: krajo-mimir
      terminationGracePeriodSeconds: 240
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: krajo-mimir-config
      - configMap:
          defaultMode: 420
          name: krajo-mimir-runtime
        name: runtime-config
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: microk8s-hostpath
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentRevision: krajo-mimir-store-gateway-zone-a-6b64ccdc98
  observedGeneration: 4
  readyReplicas: 1
  replicas: 1
  updateRevision: krajo-mimir-store-gateway-zone-a-764d89475

@krajorama
Copy link
Contributor Author

So it turns out to be an issue of a missing "name" label in the statefulset template (not object name, but actual label) required by the operator here: https://github.com/grafana/rollout-operator/blob/main/pkg/controller/controller.go#L402

User suggestions and questions:
"

  • docs / README for it should probably be updated to mention this as a requirement beyond just the rollout-group and zones labels
  • IDK why this is necessary to begin with. between rollout-group and zone labels, why is name needed? I think the purpose of the function is to simply get a list of pods belonging to a specific zone/rollout-group. So why the "name" and not use those two.
  • the change you put in. The name is being set to the comp.zonename. Just want to confirm this is sufficient and it doesnt need to be a pod specific "name". This makes this label a bit confusion to have a label on a pod called "name" but its not the name of the pod.

"

@pracucci
Copy link
Collaborator

I think we can remove the name label requirement. See:
#15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants