Skip to content

Commit

Permalink
Add a new experimental restart policy for large scale model training (#…
Browse files Browse the repository at this point in the history
…922)

Summary:
Pull Request resolved: #922

TSIA

Reviewed By: andywag

Differential Revision: D58684341
  • Loading branch information
manav-a authored and facebook-github-bot committed Jun 18, 2024
1 parent 5058b6b commit 5207938
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion torchx/specs/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,11 +237,13 @@ class RetryPolicy(str, Enum):
application to deal with failed replica departures and
replacement replica admittance.
2. APPLICATION: Restarts the entire application.
3. HOT_SPARE: Restarts the replicas for a role as long as quorum (min_replicas)
is not violated using extra hosts as spares. (EXPERIMENTAL)
"""

REPLICA = "REPLICA"
APPLICATION = "APPLICATION"
HOT_SPARE = "HOT_SPARE"


class MountType(str, Enum):
Expand Down Expand Up @@ -340,6 +342,8 @@ class Role:
and num_replicas depending on the cluster resources and
policies. If the scheduler doesn't support auto scaling this
field is ignored and the job size will be num_replicas.
EXPERIMENTAL: For HOT_SPARE restart policy this field is used to
indicate the quorum required for the job to run.
max_retries: max number of retries before giving up
retry_policy: retry behavior upon replica failures
resource: Resource requirement for the role. The role should be scheduled
Expand Down

0 comments on commit 5207938

Please sign in to comment.