Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy TPU Webhook to ray-system Namespace and Remove RayCluster Label Selectors #354

Merged
merged 8 commits into from
Mar 15, 2024

Conversation

ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Mar 14, 2024

This PR changes the namespace used to deploy the webhook, service, and certificate to ray-system rather than default, creating the ray-system namespace with kubectl create namespace if it does not already exist. This PR also separates the /validate operations the webhook performs into two configs: validate-pod- and validate-raycluster-. validate-pod- continues to select objects based on the automatically applied app.kubernetes.io/name: kuberay label, while validate-raycluster- now intercepts any RayCluster creation requests to check whether they request TPUs.

In a follow-up PR:
I plan to create a helm-chart for the webhook and have it installed in the same namespace alongside kuberay-operator in modules/kuberay-operator/kuberay.tf. Users will have the option of manually deploying the webhook using the Makefile or creating it through the Terraform in applications/ray when enable_tpu = true.

Related Issues:
#306
#307
#308

Testing Strategy:

  • Manual Tests
    • This PR was manually tested by deploying the webhook with the updated configuration, verifying that it successfully runs in ray-system, deploying a multi-host 2x2x2 RayCluster with 2 replicas (causing 4 pods to scale up), and checking that the webhook correctly injected all pod affinities, labels, and env vars using kubectl describe.

@andrewsykim
Copy link
Collaborator

andrewsykim commented Mar 14, 2024

@ryanaoleary can we follow some of the guidance around webhooks mentioned here: https://cloud.google.com/kubernetes-engine/docs/how-to/optimize-webhooks

Specifically this section about excluding kube-system and kube-node-lease namespaces.

@ryanaoleary
Copy link
Collaborator Author

@ryanaoleary can we follow some of the guidance around webhooks mentioned here: https://cloud.google.com/kubernetes-engine/docs/how-to/optimize-webhooks

Specifically this section about excluding kube-system and kube-node-lease namespaces.

Done, I changed it to exclude those two namespaces.

@ryanaoleary
Copy link
Collaborator Author

/gcbrun

@andrewsykim andrewsykim changed the title Deploy Webhook to Ray-System Namespace and Remove RayCluster Label Selectors Deploy TPU Webhook to ray-system Namespace and Remove RayCluster Label Selectors Mar 15, 2024
Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ryanaoleary since we don't have automated tests, can you document manual steps you took to test this change?

@ryanaoleary
Copy link
Collaborator Author

LGTM

@ryanaoleary since we don't have automated tests, can you document manual steps you took to test this change?

Done.

@ryanaoleary ryanaoleary merged commit 010c2ae into main Mar 15, 2024
7 checks passed
- kuberay-tpu-webhook.default.svc
- kuberay-tpu-webhook.default.svc.cluster.local
- kuberay-tpu-webhook.ray-system.svc
- kuberay-tpu-webhook.ray-system.svc.cluster.local
issuerRef:
name: selfsigned-issuer
secretTemplate:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the reflection annotations below can be remnoved now)

@ryanaoleary ryanaoleary deleted the webhook-update branch April 1, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants