Fix: Always propagate pytorch task worker process exception timestamp to task exception #3057

fg91 · 2025-01-14T20:02:26Z

Why are the changes needed?

Since RFC flyteorg/flyte#5598, flytepropeller can collect error files from multiple worker pods of a distributed task in order to determine the root cause error by identifying the exception with the earliest timestamp. Before, only a single error file was used even if a task launched multiple pods that each could produce an error file.

Generally, the timestamp of the container error is measured here in the pod entrypoint.

In the case of the pytorch elastic plugin where each worker pod can launch multiple worker processes, torch elastic launch determines the worker process exception with the earliest timestamp. The earliest worker process timestamp determined by torch elastic launch is considered the timestamp with which the entire pytorch elastic task pod failed here.

However, we only propagate the timestamp of the worker process exception to the task exception in the case of recoverable errors. In the case of non-recoverable errors we don't propagate the timestamp and instead rely on a new timestamp being measured in the pod entrypoint (see above).

I observed that sometimes the timestamps measured in the pod entrypoint are roughly the same even though the timestamps in the worker processes in the different pods definitely differed. This suggests that in these cases torch distributed still manages to synchronize after one worker experienced an exception so that the pod entrypoints execute get_container_error_timestamp(...) roughly at the same time. The fix is obvious: propagate the worker process exception timestamp to the task exception also for non-recoverable errors.

Added unit tests.

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

#2797

Summary by Bito

Updates exception handling in PyTorch elastic task tests to use FlyteUserRuntimeException instead of RuntimeError. Improves worker process exception timestamp propagation for both recoverable and non-recoverable scenarios. Enhances test expectations and imports required exception classes.

Unit tests added: False

Estimated effort to review (1-5, lower is better): 1

… to task exception Signed-off-by: Fabio Grätz <[email protected]>

flyte-bot · 2025-01-14T20:02:40Z

Code Review Agent Run #3d1353

Actionable Suggestions - 3

plugins/flytekit-kf-pytorch/tests/test_elastic_task.py - 1
- Consider consolidating duplicate test functions · Line 281-312
plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py - 1
- Consider preserving original error message · Line 478-478
flytekit/exceptions/user.py - 1
- Consider adding timestamp parameter validation · Line 18-18

Review Details

Files reviewed - 3 · Commit Range: 4044c95..4044c95
- flytekit/exceptions/user.py
- plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py
- plugins/flytekit-kf-pytorch/tests/test_elastic_task.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

flyte-bot · 2025-01-14T20:06:56Z

Changelist by Bito

This pull request implements the following key changes.

Key Change	Files Impacted
Bug Fix - Exception Timestamp Propagation Fix	- `user.py` - Added timestamp parameter to FlyteUserRuntimeException initialization - `task.py` - Modified exception handling to propagate worker process timestamp - `test_elastic_task.py` - Added tests for exception timestamp propagation and updated exception type

flyte-bot · 2025-01-14T20:06:58Z

plugins/flytekit-kf-pytorch/tests/test_elastic_task.py

+def test_exception_timestamp() -> None:
+    """Test that the timestamp of the worker process exception is propagated to the task exception."""
+    @task(
+        task_config=Elastic(
+            nnodes=1,
+            nproc_per_node=2,
+        )
+    )
+    def test_task():
+        raise Exception("Test exception")
+
+    with pytest.raises(Exception) as e:
+        test_task()
+
+    assert e.value.timestamp is not None
+
+
+def test_recoverable_exception_timestamp() -> None:
+    """Test that the timestamp of the worker process exception is propagated to the task exception."""
+    @task(
+        task_config=Elastic(
+            nnodes=1,
+            nproc_per_node=2,
+        )
+    )
+    def test_task():
+        raise FlyteRecoverableException("Recoverable test exception")
+
+    with pytest.raises(Exception) as e:
+        test_task()
+
+    assert e.value.timestamp is not None


Consider consolidating duplicate test functions

Consider consolidating the two test functions test_exception_timestamp() and test_recoverable_exception_timestamp() into a single parameterized test since they follow the same pattern and only differ in the exception type.

Code suggestion

Check the AI-generated fix before applying

Suggested change

def test_exception_timestamp() -> None:

"""Test that the timestamp of the worker process exception is propagated to the task exception."""

@task(

task_config=Elastic(

nnodes=1,

nproc_per_node=2,

)

)

def test_task():

raise Exception("Test exception")

with pytest.raises(Exception) as e:

test_task()

assert e.value.timestamp is not None

def test_recoverable_exception_timestamp() -> None:

"""Test that the timestamp of the worker process exception is propagated to the task exception."""

@task(

task_config=Elastic(

nnodes=1,

nproc_per_node=2,

)

)

def test_task():

raise FlyteRecoverableException("Recoverable test exception")

with pytest.raises(Exception) as e:

test_task()

assert e.value.timestamp is not None

def test_recoverable_exception_timestamp() -> None:

@pytest.mark.parametrize("exception_cls,message", [

(Exception, "Test exception"),

(FlyteRecoverableException, "Recoverable test exception")

])

def test_exception_timestamp(exception_cls, message) -> None:

"""Test that the timestamp of the worker process exception is propagated to the task exception."""

task_config=Elastic(

nnodes=1,

nproc_per_node=2,

)

)

def test_task():

raise exception_cls(message)

with pytest.raises(Exception) as e:

test_task()

assert e.value.timestamp is not None

Code Review Run #3d1353

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-14T20:06:59Z

plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py

@@ -475,7 +475,7 @@ def fn_partial():
                # the automatically assigned timestamp based on exception creation time
                raise FlyteRecoverableException(e.format_msg(), timestamp=first_failure.timestamp)
            else:
-                raise RuntimeError(e.format_msg())
+                raise FlyteUserRuntimeException(e, timestamp=first_failure.timestamp)


Consider preserving original error message

Consider preserving the original error message from e.format_msg() when raising FlyteUserRuntimeException. The error message could provide valuable debugging context that is currently being lost.

Code suggestion

Check the AI-generated fix before applying

Suggested change

raise FlyteUserRuntimeException(e, timestamp=first_failure.timestamp)

raise FlyteUserRuntimeException(e.format_msg(), timestamp=first_failure.timestamp)

Code Review Run #3d1353

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-14T20:07:00Z

flytekit/exceptions/user.py

@@ -15,14 +15,15 @@ class FlyteUserException(_FlyteException):
 class FlyteUserRuntimeException(_FlyteException):
    _ERROR_CODE = "USER:RuntimeError"

-    def __init__(self, exc_value: Exception):
+    def __init__(self, exc_value: Exception, timestamp: typing.Optional[float] = None):


Consider adding timestamp parameter validation

Consider adding validation for the timestamp parameter to ensure it's a valid timestamp value when provided.

Code suggestion

Check the AI-generated fix before applying

@@ -18,2 +18,5 @@ def __init__(self, exc_value: Exception, timestamp: typing.Optional[float] = None): + if timestamp is not None and (not isinstance(timestamp, (int, float)) or timestamp < 0): + raise ValueError("timestamp must be a non-negative number representing seconds since epoch") super().__init__(str(exc_value), timestamp=timestamp)

Code Review Run #3d1353

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

codecov · 2025-01-14T20:24:59Z

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 51.19%. Comparing base (34af2e2) to head (4044c95).
Report is 26 commits behind head on master.

Files with missing lines	Patch %	Lines
flytekit/exceptions/user.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #3057       +/-   ##
===========================================
- Coverage   82.79%   51.19%   -31.61%     
===========================================
  Files           3      202      +199     
  Lines         186    21297    +21111     
  Branches        0     2745     +2745     
===========================================
+ Hits          154    10902    +10748     
- Misses         32     9798     +9766     
- Partials        0      597      +597

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Fabio Grätz <[email protected]>

flyte-bot · 2025-01-14T21:33:37Z

Code Review Agent Run #3ffb7d

Actionable Suggestions - 0

Review Details

Files reviewed - 1 · Commit Range: 4044c95..c2ffc6d
- plugins/flytekit-kf-pytorch/tests/test_elastic_task.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

thomasjpfan

LGTM

… to task exception (flyteorg#3057) * Fix: Always propagate pytorch task worker process exception timestamp to task exception Signed-off-by: Fabio Grätz <[email protected]> * Fix exist recoverable error test Signed-off-by: Fabio Grätz <[email protected]> --------- Signed-off-by: Fabio Grätz <[email protected]> Co-authored-by: Fabio Grätz <[email protected]>

… to task exception (#3057) * Fix: Always propagate pytorch task worker process exception timestamp to task exception Signed-off-by: Fabio Grätz <[email protected]> * Fix exist recoverable error test Signed-off-by: Fabio Grätz <[email protected]> --------- Signed-off-by: Fabio Grätz <[email protected]> Co-authored-by: Fabio Grätz <[email protected]>

* Make FlyteUserRuntimeException to return error_code in Container Error (#3059) * Make FlyteUserRuntimeException to return error_code in the ContainerError Signed-off-by: Rafael Ribeiro Raposo <[email protected]> * [Flytekit] Separate remote signal functions (#2933) * feat: separate remote signal functions Signed-off-by: mao3267 <[email protected]> * refactor: make lint Signed-off-by: mao3267 <[email protected]> * test: add integration test for separated signal functions Signed-off-by: mao3267 <[email protected]> * fix: register workflow to admin Signed-off-by: mao3267 <[email protected]> * fix: integration test and approve function Signed-off-by: mao3267 <[email protected]> * fix: remove approve node output Signed-off-by: mao3267 <[email protected]> * fix: replace single sleep command to retry statement Signed-off-by: mao3267 <[email protected]> * fix: update comments Signed-off-by: mao3267 <[email protected]> * fix: simplify duplicate retry operations Signed-off-by: mao3267 <[email protected]> --------- Signed-off-by: mao3267 <[email protected]> * Only copy over cat-certificates.crt if it does not exist in base image (#3067) * Do not copy over ca-certifcates.crt if the base image has one Signed-off-by: Thomas J. Fan <[email protected]> * Only copy over cat-certificates.crt if it does not exist in base image Signed-off-by: Thomas J. Fan <[email protected]> --------- Signed-off-by: Thomas J. Fan <[email protected]> * Support with_overrides setting metadata for map_task subnode instead of parent node (#2982) * test Signed-off-by: Paul Dittamo <[email protected]> * add support for with_overrides for map tasks Signed-off-by: Paul Dittamo <[email protected]> * expand unit test Signed-off-by: Paul Dittamo <[email protected]> * cleanup Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * fix: remove duplication log when execute (#3052) Signed-off-by: Vincent <[email protected]> * Fix: Always propagate pytorch task worker process exception timestamp to task exception (#3057) * Fix: Always propagate pytorch task worker process exception timestamp to task exception Signed-off-by: Fabio Grätz <[email protected]> * Fix exist recoverable error test Signed-off-by: Fabio Grätz <[email protected]> --------- Signed-off-by: Fabio Grätz <[email protected]> Co-authored-by: Fabio Grätz <[email protected]> * Allow user-defined dataclass type transformer (again) (#3075) * Allow for user-defined dataclass type tranformers Signed-off-by: Eduardo Apolinario <[email protected]> * Finish comment and remote user-defined dataclass transformer from registry Signed-off-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Rafael Ribeiro Raposo <[email protected]> Signed-off-by: mao3267 <[email protected]> Signed-off-by: Thomas J. Fan <[email protected]> Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: Vincent <[email protected]> Signed-off-by: Fabio Grätz <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Rafael Raposo <[email protected]> Co-authored-by: Vincent Chen <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Paul Dittamo <[email protected]> Co-authored-by: V <[email protected]> Co-authored-by: Fabio M. Graetz, Ph.D. <[email protected]> Co-authored-by: Fabio Grätz <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]>

… to task exception (flyteorg#3057) * Fix: Always propagate pytorch task worker process exception timestamp to task exception Signed-off-by: Fabio Grätz <[email protected]> * Fix exist recoverable error test Signed-off-by: Fabio Grätz <[email protected]> --------- Signed-off-by: Fabio Grätz <[email protected]> Co-authored-by: Fabio Grätz <[email protected]> Signed-off-by: lu00122 <[email protected]>

Fix: Always propagate pytorch task worker process exception timestamp…

4044c95

… to task exception Signed-off-by: Fabio Grätz <[email protected]>

fg91 requested review from wild-endeavor, kumare3, eapolinario, pingsutw, cosmicBboy, samhita-alla, thomasjpfan and Future-Outlier as code owners January 14, 2025 20:02

flyte-bot reviewed Jan 14, 2025

View reviewed changes

Fix exist recoverable error test

c2ffc6d

Signed-off-by: Fabio Grätz <[email protected]>

thomasjpfan approved these changes Jan 18, 2025

View reviewed changes

fg91 merged commit 3260ddf into master Jan 18, 2025
104 checks passed

fg91 deleted the fg91/fix/pytorch-task-exception-timestamp branch January 18, 2025 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Always propagate pytorch task worker process exception timestamp to task exception #3057

Fix: Always propagate pytorch task worker process exception timestamp to task exception #3057

fg91 commented Jan 14, 2025 •

edited by flyte-bot

Loading

flyte-bot commented Jan 14, 2025 •

edited

Loading

Code Review Agent Run #3d1353

flyte-bot commented Jan 14, 2025 •

edited

Loading

Changelist by Bito

flyte-bot Jan 14, 2025

flyte-bot Jan 14, 2025

flyte-bot Jan 14, 2025

codecov bot commented Jan 14, 2025

flyte-bot commented Jan 14, 2025 •

edited

Loading

Code Review Agent Run #3ffb7d

thomasjpfan left a comment

	raise FlyteUserRuntimeException(e, timestamp=first_failure.timestamp)
	raise FlyteUserRuntimeException(e.format_msg(), timestamp=first_failure.timestamp)

Fix: Always propagate pytorch task worker process exception timestamp to task exception #3057

Fix: Always propagate pytorch task worker process exception timestamp to task exception #3057

Conversation

fg91 commented Jan 14, 2025 • edited by flyte-bot Loading

Why are the changes needed?

Related PRs

Summary by Bito

flyte-bot commented Jan 14, 2025 • edited Loading

Code Review Agent Run #3d1353

flyte-bot commented Jan 14, 2025 • edited Loading

Changelist by Bito

flyte-bot Jan 14, 2025

Choose a reason for hiding this comment

flyte-bot Jan 14, 2025

Choose a reason for hiding this comment

flyte-bot Jan 14, 2025

Choose a reason for hiding this comment

codecov bot commented Jan 14, 2025

Codecov Report

flyte-bot commented Jan 14, 2025 • edited Loading

Code Review Agent Run #3ffb7d

thomasjpfan left a comment

Choose a reason for hiding this comment

fg91 commented Jan 14, 2025 •

edited by flyte-bot

Loading

flyte-bot commented Jan 14, 2025 •

edited

Loading

flyte-bot commented Jan 14, 2025 •

edited

Loading

flyte-bot commented Jan 14, 2025 •

edited

Loading