Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix certgen job race condition #5082

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ryanhristovski
Copy link
Contributor

@ryanhristovski ryanhristovski commented Jan 16, 2025

Envoy-gateway certgen job frequently goes into a crash loop when running a sync using ArgoCD in our environment with the following error:

envoy-gateway-gateway-helm-certgen-pxnd7 envoy-gateway-certgen failed to output certificates: failed to create or update secrets: failed to get secret infra-routing/envoy-gateway: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Unauthorized

Though, sometimes this job succeeds with no problems...

I believe this inconsistency is due to a race condition between the RBAC resources and the certgen job (both installed with pre-install). Since the RBAC resources do not exist continuously, and are only applied on installation and then removed, I believe the job Is trying to start before the permissions are available. After setting the hook-weight annotation in our environment we no longer see this problem.

Another solution could be to have the RBAC resources exist continuously without the pre-hook but I imagine this was a conscious security decision.

@ryanhristovski ryanhristovski requested a review from a team as a code owner January 16, 2025 16:13
@ryanhristovski
Copy link
Contributor Author

/retest

Copy link

codecov bot commented Jan 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.84%. Comparing base (27a2ef3) to head (23be57b).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5082      +/-   ##
==========================================
- Coverage   66.85%   66.84%   -0.01%     
==========================================
  Files         211      211              
  Lines       32916    32920       +4     
==========================================
  Hits        22007    22007              
- Misses       9584     9587       +3     
- Partials     1325     1326       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shahar-h
Copy link
Contributor

The hook-weight should not make any difference since RBAC resources are installed by helm before Job. See: https://helm.sh/docs/intro/using_helm/#helm-install-installing-a-package.

@ryanhristovski
Copy link
Contributor Author

ryanhristovski commented Jan 16, 2025

@shahar-h ArgoCD doesn't use helm install to apply its resources from my understanding, it applies yaml rendered with helm template - which I believe can cause this race condition.

@shahar-h
Copy link
Contributor

@shahar-h ArgoCD doesn't use helm install to apply its resources from my understanding, it applies yaml rendered with helm template - which I believe can cause this race condition.

According to argocd install order Job is installed after RBAC resources.

@ryanhristovski
Copy link
Contributor Author

ryanhristovski commented Jan 16, 2025

@shahar-h I was able to replicate my error with the following code, using a server-side apply... but its very inconsistent (it happens about 1/20 times, trying to understand why, and to see if the hook-weight even fixed this) 😅

---
apiVersion: v1
kind: Namespace
metadata:
  name: test-namespace
  labels:
    name: test-namespace
---
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
  namespace:test-namespace
  annotations:
    "helm.sh/hook": pre-install, pre-upgrade
spec:
  template:
    spec:
      serviceAccountName: test-sa
      containers:
        - name: test-container
          # Using 'bitnami/kubectl' so we can run `kubectl get secrets`.
          image: bitnami/kubectl:latest
          command: ["/bin/sh", "-c"]
          args:
            - |
              echo "Attempting to read secrets..."
              kubectl get secrets
              echo "If you see an RBAC or SA error above, the job ran before RBAC was ready."
      restartPolicy: Never
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: test-sa
  namespace: test-namespace
  annotations:
    "helm.sh/hook": pre-install
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: test-role
  namespace: test-namespace
  annotations:
    "helm.sh/hook": pre-install
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: test-rolebinding
  namespace: test-namespace
  annotations:
    "helm.sh/hook": pre-install
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: test-role
subjects:
  - kind: ServiceAccount
    name: test-sa
    namespace: infra-routing

Setting to draft while I confirm this.

@ryanhristovski ryanhristovski marked this pull request as draft January 16, 2025 23:21
@ryanhristovski ryanhristovski marked this pull request as ready for review January 16, 2025 23:23
@ryanhristovski ryanhristovski marked this pull request as draft January 16, 2025 23:44
Signed-off-by: Ryan Hristovski <[email protected]>
@ryanhristovski ryanhristovski changed the title set certgen job & rbac helm hook-weights fix certgen job race condition Jan 17, 2025
Signed-off-by: Ryan Hristovski <[email protected]>
@ryanhristovski ryanhristovski marked this pull request as ready for review January 17, 2025 00:51
@ryanhristovski
Copy link
Contributor Author

ryanhristovski commented Jan 17, 2025

@shahar-h I can't understand why this is failing, my only theory is that the order is not respected when it comes to hooks... regardless of being able to replicate the error, the more I thought about it, I don't think having a pre-install hook for RBAC makes sense.

The object persists throughout and they are just recreated each time a sync is triggered with nothing new added, I don't see any benefit in them being recreated each time theres a sync. I updated this PR to remove the pre-install annotations on RBAC, let me know what you think.

Edit: will fix tests if I get positive consensus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants