Skip to content

Commit

Permalink
Fix: Distributed Training Rendezvous error with MCAD v.1.34.1 (#793)
Browse files Browse the repository at this point in the history
* fix: distributed rendezvous error with MCAD v.1.34.1

* fix: Update tests
  • Loading branch information
Sara-KS committed Nov 20, 2023
1 parent b24f92e commit c3868c1
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 3 deletions.
2 changes: 1 addition & 1 deletion torchx/schedulers/kubernetes_mcad_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -436,7 +436,7 @@ def mcad_svc(
target_port=int(service_port),
)
],
selector={"appwrapper.workload.codeflare.dev": svc_name},
selector={LABEL_UNIQUE_NAME: svc_name},
session_affinity="None",
type="ClusterIP",
),
Expand Down
4 changes: 2 additions & 2 deletions torchx/schedulers/test/kubernetes_mcad_scheduler_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -450,7 +450,7 @@ def test_create_mcad_service(self) -> None:
target_port=int(service_port),
)
],
selector={"appwrapper.workload.codeflare.dev": service_name},
selector={"app.kubernetes.io/instance": service_name},
session_affinity="None",
type="ClusterIP",
),
Expand Down Expand Up @@ -667,7 +667,7 @@ def test_submit_dryrun(self) -> None:
targetPort: 1234
publishNotReadyAddresses: true
selector:
appwrapper.workload.codeflare.dev: app-name
app.kubernetes.io/instance: app-name
sessionAffinity: None
type: ClusterIP
status:
Expand Down

0 comments on commit c3868c1

Please sign in to comment.