Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: Document how to enable distributed error aggregation according to RFC #5598 for pytorch distributed tasks #1776

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Jan 14, 2025

flyteorg/flyte#6103 made the distributed error aggregation behavior proposed in flyteorg/flyte#5598 opt-in.

In this PR I'm documenting in the pytorch task docs page that this feature exists and how to enable it.

See rendered docs here.

…o RFC #5598 for pytorch distributed tasks

Signed-off-by: Fabio Grätz <[email protected]>
@@ -350,6 +350,14 @@ def pytorch_training_wf(
# To visualize the outcomes, you can point Tensorboard on your local machine to these storage locations.
# :::
#
# :::{note}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of scope creep:
I'm moving this section up before the "pytorch elastic" section as this affects only task_config=Pytorch tasks. Tasks using task_config=Elastic do this by default here.

@fg91 fg91 requested a review from eapolinario January 14, 2025 21:18
@fg91 fg91 self-assigned this Jan 14, 2025
@fg91 fg91 marked this pull request as ready for review January 14, 2025 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant