[FIX] ctc_loss_op_test_gpu unit test suite failure #3083

AleksaArsic · 2025-08-12T15:41:07Z

Motivation

The command bazel test -s --config=rocm //tensorflow/python/kernel_tests/nn_ops:ctc_loss_op_test_gpu will produce failure of CTCLossDeterministicTest.testForwardAndBackward on gfx11xx and gfx12xx. The failure occured due to one element that represents the difference between gradients is greater than the previous absolute tolerance.

Log:

======================================================================
FAIL: testForwardAndBackward2 (True, False) (__main__.CTCLossDeterministicTest) [graph_mode]
CTCLossDeterministicTest.testForwardAndBackward2 (True, False)
testForwardAndBackward(True, False)
----------------------------------------------------------------------
...
Mismatched elements: 1 / 32000 (0.00313%)
Max absolute difference among violations: 7.075071e-05
Max relative difference among violations: 0.00036804
...
======================================================================
FAIL: testForwardAndBackward3 (True, True) (__main__.CTCLossDeterministicTest) [graph_mode]
CTCLossDeterministicTest.testForwardAndBackward3 (True, True)
testForwardAndBackward(True, True)
----------------------------------------------------------------------
...
Mismatched elements: 1 / 32000 (0.00313%)
Max absolute difference among violations: 5.2332878e-05
Max relative difference among violations: 0.00028655
...
----------------------------------------------------------------------

Technical Details

Increase the absolute tolerance between gradients in CTCLossDeterministicTest.testForwardAndBackward due to the floating-point arithmetic.

tensorflow/python/kernel_tests/nn_ops/ctc_loss_op_test.py

…sticTest.testForwardAndBackward due to the floating-point arithmetics for gfx11xx and gfx12xx. Implement test_util.gpu_gcn_arch() which returns the information on the current gcn architecture. Implement unit test for test_util.gpu_gcn_arch(). Modify GetShortDeviceDescription() in gpu_device.cc to return gcn arch as well.

ScXfjiang · 2025-08-20T08:59:25Z

tensorflow/core/common_runtime/gpu/gpu_device.cc

@@ -1970,6 +1970,7 @@ static string GetShortDeviceDescription(
 #elif TENSORFLOW_USE_ROCM
  return strings::StrCat("device: ", platform_device_id.value(),
                         ", name: ", desc.name(),
+                         ", gcn arch: ", desc.rocm_compute_capability().gcn_arch_name(),


Would it be better if we use the same key-value pairs as CUDA, though they return the same thing?

compute capability: desc.rocm_compute_capability().ToString()

Here we return a specific part of the compute capability; therefore I've used getter specific for returning the gcn_arch_name.
Otherwise, if we inspect device_description.h we can observe that CudaComputeCapability has method ToString() that returns the version of the software in a format "major"."minor". Inspecting RocmComputeCapability we can see that there is no equivalent method in it. Nevertheless, it would not fit our needs. :)

ScXfjiang · 2025-08-20T08:59:30Z

tensorflow/python/framework/test_util.py

@@ -189,6 +189,29 @@ def gpu_device_name() -> str:
      return compat.as_str(x.name)
  return ""

+@tf_export("test.gpu_gcn_arch")


Does CUDA have the counterpart? What's the behavior if this API is called on the CUDA platform?

Should we add the rocm_ prefix here?

If this method is called on a system with CUDA support, it will just return empty string, as the CUDA description is not the same as the the ROCm one.

We could be more descriptive with the name and detail design comment.

i-chaochen

LGTM

AleksaArsic marked this pull request as ready for review August 12, 2025 15:46

AleksaArsic requested a review from i-chaochen August 12, 2025 15:46

This comment was marked as duplicate.

Sign in to view

i-chaochen reviewed Aug 13, 2025

View reviewed changes

tensorflow/python/kernel_tests/nn_ops/ctc_loss_op_test.py Outdated Show resolved Hide resolved

ScXfjiang approved these changes Aug 13, 2025

View reviewed changes

AleksaArsic force-pushed the fix-ctc-loss-op-test-fail branch 3 times, most recently from 0d3905b to c8301b6 Compare August 19, 2025 20:17

AleksaArsic requested review from i-chaochen and ScXfjiang August 19, 2025 20:17

AleksaArsic force-pushed the fix-ctc-loss-op-test-fail branch from c8301b6 to 1fce875 Compare August 20, 2025 08:01

AleksaArsic force-pushed the fix-ctc-loss-op-test-fail branch from 1fce875 to 574fddb Compare August 20, 2025 08:02

ScXfjiang reviewed Aug 20, 2025

View reviewed changes

i-chaochen approved these changes Sep 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX] ctc_loss_op_test_gpu unit test suite failure #3083

[FIX] ctc_loss_op_test_gpu unit test suite failure #3083

Uh oh!

AleksaArsic commented Aug 12, 2025 •

edited

Loading

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!

ScXfjiang Aug 20, 2025 •

edited

Loading

Uh oh!

AleksaArsic Aug 21, 2025 •

edited

Loading

Uh oh!

ScXfjiang Aug 20, 2025

Uh oh!

AleksaArsic Aug 21, 2025

Uh oh!

i-chaochen left a comment

Uh oh!

Uh oh!

[FIX] ctc_loss_op_test_gpu unit test suite failure #3083

Are you sure you want to change the base?

[FIX] ctc_loss_op_test_gpu unit test suite failure #3083

Uh oh!

Conversation

AleksaArsic commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!

ScXfjiang Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AleksaArsic Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScXfjiang Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

AleksaArsic Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

i-chaochen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AleksaArsic commented Aug 12, 2025 •

edited

Loading

ScXfjiang Aug 20, 2025 •

edited

Loading

AleksaArsic Aug 21, 2025 •

edited

Loading