Skip to content

Commit

Permalink
Add a new experimental restart policy for large scale model training
Browse files Browse the repository at this point in the history
Differential Revision: D58684341

Pull Request resolved: #922
  • Loading branch information
manav-a committed Jun 18, 2024
1 parent 7bfa26d commit cb1fec1
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion torchx/specs/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,11 +237,15 @@ class RetryPolicy(str, Enum):
application to deal with failed replica departures and
replacement replica admittance.
2. APPLICATION: Restarts the entire application.
3. HOT_SPARE: Restarts the replicas for a role as long as quorum (min_replicas)
is not violated using extra hosts as spares. It does not really support
elasticity and just uses the delta between num_replicas and min_replicas
as spares (EXPERIMENTAL).
"""

REPLICA = "REPLICA"
APPLICATION = "APPLICATION"
HOT_SPARE = "HOT_SPARE"


class MountType(str, Enum):
Expand Down Expand Up @@ -340,6 +344,8 @@ class Role:
and num_replicas depending on the cluster resources and
policies. If the scheduler doesn't support auto scaling this
field is ignored and the job size will be num_replicas.
EXPERIMENTAL: For HOT_SPARE restart policy this field is used to
indicate the quorum required for the job to run.
max_retries: max number of retries before giving up
retry_policy: retry behavior upon replica failures
resource: Resource requirement for the role. The role should be scheduled
Expand Down

0 comments on commit cb1fec1

Please sign in to comment.