Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU Provisioner: Fix skipped scale ups & garbage collect zombie node pools #277

Conversation

nstogner
Copy link
Collaborator

@nstogner nstogner commented Mar 4, 2024

  • Avoid scale ups when node pool is in a deletion state.
  • Avoid scale ups when the Pod being reconciled is in a deletion state.
  • Use owner UID for node pool names to avoid scale up failure when the scheduler has placed another workload on the same existing node pool.
  • Run a loop that deletes node pools that have no Nodes, are in an errored state, and where the Pod that created the node pool no longer exists (the deletion reconciler would not see these b/c there are no Node objects).
  • Update events to be more descriptive for easier debugging.
  • Make Secure Boot configurable (defaults to same behavior).
  • Make deletion delay configurable to ensure make test passes consistently.

@nstogner nstogner changed the title TPU Provisioner: Fix skipped scale ups TPU Provisioner: Fix skipped scale ups & garbage collect zombie node pools Mar 8, 2024
@nstogner nstogner force-pushed the tpu-provisioner--fix-scale-up-bug branch from 1e51be3 to 1039c27 Compare March 12, 2024 01:26
Copy link
Contributor

@danielvegamyhre danielvegamyhre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, left a few minor comments and questions

@nstogner
Copy link
Collaborator Author

Updated to address comments. Also made deletion delay configurable to make sure make test passes consistently.

@nstogner nstogner force-pushed the tpu-provisioner--fix-scale-up-bug branch from b9fdef5 to 72c34c6 Compare March 19, 2024 19:47
@nstogner
Copy link
Collaborator Author

Rebased @danielvegamyhre

@danielvegamyhre
Copy link
Contributor

/gcbrun

@danielvegamyhre danielvegamyhre merged commit 4bbabed into GoogleCloudPlatform:main Mar 19, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants