Skip to content

Commit

Permalink
[k8s jobsets] add startup policy. (Netflix#2063)
Browse files Browse the repository at this point in the history
  • Loading branch information
valayDave authored Sep 26, 2024
1 parent e2ef5d0 commit d4a0ec7
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 2 deletions.
6 changes: 5 additions & 1 deletion metaflow/plugins/kubernetes/kubernetes_decorator.py
Original file line number Diff line number Diff line change
Expand Up @@ -543,7 +543,11 @@ def _save_package_once(cls, flow_datastore, package):

# TODO: Unify this method with the multi-node setup in @batch
def _setup_multinode_environment():
# FIXME: what about MF_MASTER_PORT
# TODO [FIXME SOON]
# Even if Kubernetes may deploy control pods before worker pods, there is always a
# possibility that the worker pods may start before the control. In the case that this happens,
# the worker pods will not be able to resolve the control pod's IP address and this will cause
# the worker pods to fail. This function should account for this in the near future.
import socket

try:
Expand Down
8 changes: 7 additions & 1 deletion metaflow/plugins/kubernetes/kubernetes_jobsets.py
Original file line number Diff line number Diff line change
Expand Up @@ -866,7 +866,13 @@ def dump(self):
spec=dict(
replicatedJobs=[self.control.dump(), self.worker.dump()],
suspend=False,
startupPolicy=None,
startupPolicy=dict(
# We explicitly set an InOrder Startup policy so that
# we can ensure that the control pod starts before the worker pods.
# This is required so that when worker pods try to access the control's IP
# we are able to resolve the control's IP address.
startupPolicyOrder="InOrder"
),
successPolicy=None,
# The Failure Policy helps setting the number of retries for the jobset.
# but we don't rely on it and instead rely on either the local scheduler
Expand Down

0 comments on commit d4a0ec7

Please sign in to comment.