Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Ray TPU Webhook Reliability Improvements (GoogleCloudPlatform#723)
* Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Make webhook stateless in between mutate calls Signed-off-by: Ryan O'Leary <[email protected]> * Formatting changes Signed-off-by: Ryan O'Leary <[email protected]> * Fix bug causing incorrect IDs * Add cluster role and log formatting changes Signed-off-by: Ryan O'Leary <[email protected]> * Filter pods by Ray worker group label Signed-off-by: Ryan O'Leary <[email protected]> * Vulnerability fixes Signed-off-by: Ryan O'Leary <[email protected]> * Better names and add ServiceAccount Signed-off-by: Ryan O'Leary <[email protected]> * Change version back to v1.1 Signed-off-by: Ryan O'Leary <[email protected]> * Change implementation to use PodInformer Signed-off-by: Ryan O'Leary <[email protected]> * Use PodLister Signed-off-by: Ryan O'Leary <[email protected]> * updateSliceToWorkerIDs returns error Signed-off-by: Ryan O'Leary <[email protected]> * Use mutex lock in updateSliceToWorkerIDs Signed-off-by: Ryan O'Leary <[email protected]> * Update unit tests and fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Remove global client var Signed-off-by: Ryan O'Leary <[email protected]> * Just return err instead of logging Signed-off-by: Ryan O'Leary <[email protected]> * TODO comment Signed-off-by: Ryan O'Leary <[email protected]> * Lock when reading from shared sliceToWorkerIDs mapping Signed-off-by: Ryan O'Leary <[email protected]> * Switch to using RWMutex Signed-off-by: Ryan O'Leary <[email protected]> * Ray TPU Webhook Autoscaling Support (GoogleCloudPlatform#740) * Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Should not fatal log in deletePod Signed-off-by: Ryan O'Leary <[email protected]> * deletePod admission always succeeds Signed-off-by: Ryan O'Leary <[email protected]> * Remove unused tests make command Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Just return an error instead of logging Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]> * Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <[email protected]> * Update tests and add error checking Signed-off-by: Ryan O'Leary <[email protected]> * Close stop channel on webhook termination Signed-off-by: Ryan O'Leary <[email protected]> * Refactor webhook to avoid using global vars Signed-off-by: Ryan O'Leary <[email protected]> * Fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Change service account name Signed-off-by: Ryan O'Leary <[email protected]> * Return BadRequest if invalid kind Signed-off-by: Ryan O'Leary <[email protected]> * Fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Change error messages Signed-off-by: Ryan O'Leary <[email protected]> * Fatal log in main Signed-off-by: Ryan O'Leary <[email protected]> * Update function comments Signed-off-by: Ryan O'Leary <[email protected]> * Refactor to minimize indentations Signed-off-by: Ryan O'Leary <[email protected]> * Change sliceToWorkerIDs nil check to use len Signed-off-by: Ryan O'Leary <[email protected]> * Write http.Error to header Signed-off-by: Ryan O'Leary <[email protected]> * Don't fatal log in validateRayCluster Signed-off-by: Ryan O'Leary <[email protected]> * Check for nil admission request Signed-off-by: Ryan O'Leary <[email protected]> * Add doc comment Signed-off-by: Ryan O'Leary <[email protected]> * Update expected errors Signed-off-by: Ryan O'Leary <[email protected]> * Better getNextWorkerID logic Signed-off-by: Ryan O'Leary <[email protected]> * Update replicaIndex and nextWorkerID tests Signed-off-by: Ryan O'Leary <[email protected]> * Refactor webhook unit tests Signed-off-by: Ryan O'Leary <[email protected]> * Create numOfHosts pods for Pod List Signed-off-by: Ryan O'Leary <[email protected]> * Log admission request object name Signed-off-by: Ryan O'Leary <[email protected]> * Fix nits and go vet output Signed-off-by: Ryan O'Leary <[email protected]> * Initial cloudbuil commit Signed-off-by: Ryan O'Leary <[email protected]> * Fix vet command Signed-off-by: Ryan O'Leary <[email protected]> * Update cloudbuild Signed-off-by: Ryan O'Leary <[email protected]> * Fix cloudbuild errors Signed-off-by: Ryan O'Leary <[email protected]> * Add dir Signed-off-by: Ryan O'Leary <[email protected]> * Remove arg Signed-off-by: Ryan O'Leary <[email protected]> * Change to bash command Signed-off-by: Ryan O'Leary <[email protected]> * increase timeout time Signed-off-by: Ryan O'Leary <[email protected]> * Fix validateRayCluster test Signed-off-by: Ryan O'Leary <[email protected]> * Fix nits for cloudbuild Signed-off-by: Ryan O'Leary <[email protected]> * Break early in validateRayCluster Signed-off-by: Ryan O'Leary <[email protected]> * Remove unnecessary args from validateRayCluster test Signed-off-by: Ryan O'Leary <[email protected]> * Change break to continue Signed-off-by: Ryan O'Leary <[email protected]> * Remove unused vars from webhook tests and add edge cases Signed-off-by: Ryan O'Leary <[email protected]> * Update helm chart Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]>
- Loading branch information