Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change failurePolicy to Fail for Katib Webhooks #2018

Conversation

andreyvelich
Copy link
Member

Fixes: #1946.
I removed failurePolicy: Ignore from our Admission Webhooks, so the default policy will be used (which is Fail).

/assign @kubeflow/wg-automl-leads @tenzen-y @ca-scribner

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

Coverage Status

Coverage remained the same at 73.533% when pulling 556027c on andreyvelich:issue-1946-webhook-failure-policy into a5ef2db on kubeflow:master.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks!

/lgtm

@tenzen-y
Copy link
Member

Maybe, we need to fix E2E...

Name:             katib-controller-5474c9d7dc-hlj74
Namespace:        kubeflow
Priority:         0
Service Account:  katib-controller
Node:             fv-az202-219/10.1.0.54
Start Time:       Thu, 17 Nov 2022 18:44:07 +0000
Labels:           katib.kubeflow.org/component=controller
                  pod-template-hash=5474c9d7dc
Annotations:      prometheus.io/port: 8080
                  prometheus.io/scrape: true
                  sidecar.istio.io/inject: false
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/katib-controller-5474c9d7dc
Containers:
  katib-controller:
    Container ID:  
    Image:         docker.io/kubeflowkatib/katib-controller:e2e-test
    Image ID:      
    Ports:         8443/TCP, 8080/TCP, 18080/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      ./katib-controller
    Args:
      --webhook-port=8443
      --trial-resources=Job.v1.batch
      --trial-resources=TFJob.v1.kubeflow.org
      --trial-resources=PyTorchJob.v1.kubeflow.org
      --trial-resources=MPIJob.v1.kubeflow.org
      --trial-resources=XGBoostJob.v1.kubeflow.org
      --trial-resources=MXJob.v1.kubeflow.org
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      KATIB_CORE_NAMESPACE:  kubeflow (v1:metadata.namespace)
    Mounts:
      /tmp/cert from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqd8c (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  katib-webhook-cert
    Optional:    false
  kube-api-access-vqd8c:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age               From               Message
  ----     ------       ----              ----               -------
  Normal   Scheduled    2m1s              default-scheduler  Successfully assigned kubeflow/katib-controller-5474c9d7dc-hlj74 to fv-az202-219
  Warning  FailedMount  57s (x8 over 2m)  kubelet            MountVolume.SetUp failed for volume "cert" : secret "katib-webhook-cert" not found

https://github.com/kubeflow/katib/actions/runs/3490817753/jobs/5842728114#step:4:10217

@tenzen-y
Copy link
Member

/lgtm cancel
/hold

@andreyvelich
Copy link
Member Author

andreyvelich commented Nov 28, 2022

@tenzen-y @johnugeorge I know why e2e are failing for this change.
After this PR our Webhooks don't ignore any errors and Katib Cert Generator job can't be deployed.

When Katib Control Plane is deployed, Webhooks don't have appropriate certs and the Cert Generator Job's Pod creation failed with the following error:

Warning  FailedCreate  2m8s (x5 over 4m18s)  job-controller  Error creating: Internal error occurred: failed calling webhook "mutator.pod.katib.kubeflow.org": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block

Any ideas how we can avoid it ?
Should we add another ObjectSelector to our Pod mutation webhook which allows us (and users) to disable Webhook for the Cert Manager Job ? For example:

  namespaceSelector:
    matchLabels:
      katib.kubeflow.org/metrics-collector-injection: enabled
  objectSelector:
    matchExpressions:
    - key: katib.kubeflow.org/metrics-collector-injection
      operator: NotIn
      values:
      - disabled

Similar to Istio Sidecar injector.

@tenzen-y
Copy link
Member

@andreyvelich
In the short term, as you say, we should add the object selector to the MutatingWebhookConfiguration.

In the long term, I would like to perform processes of katib-cert-generator in katib-controller as a sub-process, then we can remove the object selector from the MutatingWebhookConfiguration.

This mean, we integrate the katib-cert-generator into the katib-controller, then we will remove the katib-cert-generator.

@andreyvelich
Copy link
Member Author

andreyvelich commented Nov 28, 2022

In the long term, I would like to perform processes of katib-cert-generator in katib-controller as a sub-process, then we can remove the object selector from the MutatingWebhookConfiguration.

@tenzen-y For example, using init container ?
But Katib Controller is dependent on Katib Webhook Cert which is created as part of Cert Generator Job. We also use this secret in Cert Manager installation.
Do you have any suggestions how we can use the same Pod to generate certs ?

@johnugeorge
Copy link
Member

@tenzen-y @johnugeorge I know why e2e are failing for this change. After this PR our Webhooks don't ignore any errors and Katib Cert Generator job can't be deployed.

When Katib Control Plane is deployed, Webhooks don't have appropriate certs and the Cert Generator Job's Pod creation failed with the following error:

Warning  FailedCreate  2m8s (x5 over 4m18s)  job-controller  Error creating: Internal error occurred: failed calling webhook "mutator.pod.katib.kubeflow.org": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block

Any ideas how we can avoid it ? Should we add another ObjectSelector to our Pod mutation webhook which allows us (and users) to disable Webhook for the Cert Manager Job ? For example:

  namespaceSelector:
    matchLabels:
      katib.kubeflow.org/metrics-collector-injection: enabled
  objectSelector:
    matchExpressions:
    - key: katib.kubeflow.org/metrics-collector-injection
      operator: NotIn
      values:
      - disabled

Similar to Istio Sidecar injector.

How is this selector specific to cert-generator?

@tenzen-y
Copy link
Member

tenzen-y commented Nov 28, 2022

For example, using init container ?

No, I think that we can create and inject cert for webhook to the katib-webhook-cert secret in the main container using the go channel like opa/gatekeeper.

https://github.com/open-policy-agent/gatekeeper/blob/1ae07b704ed7039c3da75ceec79d8cb49b962008/main.go#L195-L223

@andreyvelich
Copy link
Member Author

How is this selector specific to cert-generator?

@johnugeorge We can update our Cert Generator Job YAML with the following label, so mutation webhook won't be invoked:

  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        katib.kubeflow.org/metrics-collector-injection: disabled

@andreyvelich
Copy link
Member Author

For example, using init container ?

No, I think that we can create and inject cert for webhook to the katib-webhook-cert secret in the main container using the go channel like opa/gatekeeper.

https://github.com/open-policy-agent/gatekeeper/blob/1ae07b704ed7039c3da75ceec79d8cb49b962008/main.go#L195-L223

Make sense, so we can do it before we Register Controller Components.

@johnugeorge
Copy link
Member

The confusing part is the metrics-collector-injection label. As this doesn't have any connection with metrics collector, we can have something different.
I like @tenzen-y 's suggestion for long term

@andreyvelich
Copy link
Member Author

The confusing part is the metrics-collector-injection label. As this doesn't have any connection with metrics collector, we can have something different.

@johnugeorge Would it be useful to have this feature for our users who don't want to get Katib Mutation Webhook executed on their Pods ? Currently, if namespace has katib.kubeflow.org/metrics-collector-injection: enabled label, Webhook will be always executed, and we verify if Mutation if required.

@tenzen-y
Copy link
Member

↑ ping @johnugeorge

@johnugeorge
Copy link
Member

Yes. We can add a specific label to skip mutation web hook.

@andreyvelich
Copy link
Member Author

Let's discuss this again on December 14th (next community meeting) @johnugeorge @tenzen-y .
I want to review it again and discuss for which users this label will be useful.

If we are going to integrate cert-manager to Katib Controller startup as @tenzen-y mentioned, we can avoid this confusing label: katib.kubeflow.org/metrics-collector-injection: disabled

@andreyvelich
Copy link
Member Author

andreyvelich commented Dec 14, 2022

Let's merge this PR after next release, so we can discuss should we create label to disable Webhook.

@tenzen-y
Copy link
Member

tenzen-y commented Aug 1, 2023

I created #2185 to consolidate the katib-cert-generator to the katib-controller.
Once #2185 is merged, we can move this forward.

@andreyvelich andreyvelich force-pushed the issue-1946-webhook-failure-policy branch from f4e7bad to f6f63e3 Compare August 4, 2023 19:37
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreyvelich !
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Aug 4, 2023
@andreyvelich
Copy link
Member Author

Tests are complete 🎉
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 114485d into kubeflow:master Aug 4, 2023
58 checks passed
@andreyvelich andreyvelich deleted the issue-1946-webhook-failure-policy branch August 4, 2023 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should MutatingWebhook failurePolicy be Ignore?
4 participants