Skip to content

Commit

Permalink
Remove timeout from RayCluster CR apply; bump CB timeout to mitigate …
Browse files Browse the repository at this point in the history
…stockouts (#576)

Remove timeout from RayCluster CR apply

RayCluster apply takes O(seconds). The actual ray worker deployment is done asynchronously by the ray operator.
  • Loading branch information
artemvmin authored and kfswain committed Apr 15, 2024
1 parent 1ec6e8e commit 4e640a9
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 6 deletions.
6 changes: 3 additions & 3 deletions cloudbuild.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ steps:
echo "pass" > /workspace/user_result.txt
# Make sure pods are running
kubectl wait --all pods -n ml-$SHORT_SHA-$_BUILD_ID --for=condition=Ready --timeout=300s
kubectl wait --all pods -n ml-$SHORT_SHA-$_BUILD_ID --for=condition=Ready --timeout=1200s
kubectl port-forward -n ml-$SHORT_SHA-$_BUILD_ID service/ray-cluster-kuberay-head-svc 8265:8265 &
# Wait port-forwarding to take its place
sleep 5s
Expand Down Expand Up @@ -156,7 +156,7 @@ steps:
-auto-approve -no-color -lock=false
echo "pass" > /workspace/jupyterhub_tf_result.txt
kubectl wait --all pods -n ml-$SHORT_SHA-$_BUILD_ID --for=condition=Ready --timeout=300s
kubectl wait --all pods -n ml-$SHORT_SHA-$_BUILD_ID --for=condition=Ready --timeout=1200s
kubectl get services -n ml-$SHORT_SHA-$_BUILD_ID
kubectl port-forward -n ml-$SHORT_SHA-$_BUILD_ID service/proxy-public 9443:80 &
# Wait port-forwarding to take its place
Expand Down Expand Up @@ -227,7 +227,7 @@ steps:
echo "pass" > /workspace/rag_tf_result.txt
# Validate Ray: Make sure pods are running
kubectl wait --all pods -n rag-$SHORT_SHA-$_BUILD_ID --for=condition=Ready --timeout=300s
kubectl wait --all pods -n rag-$SHORT_SHA-$_BUILD_ID --for=condition=Ready --timeout=1200s
kubectl port-forward -n rag-$SHORT_SHA-$_BUILD_ID service/ray-cluster-kuberay-head-svc 8265:8265 &
# Wait port-forwarding to take its place
sleep 5s
Expand Down
3 changes: 0 additions & 3 deletions modules/kuberay-cluster/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,6 @@ resource "helm_release" "ray-cluster" {
namespace = var.namespace
create_namespace = true
version = "1.0.0"
# Timeout is increased to guarantee sufficient scale-up time for Autopilot nodes.
timeout = 1200
wait = true

values = [
templatefile("${path.module}/values.yaml", {
Expand Down

0 comments on commit 4e640a9

Please sign in to comment.