[ROCm] Implement RNN support #25755

Ruturaj4 · 2025-01-07T17:47:14Z

Created from: ROCm#171

Ruturaj4 · 2025-01-07T17:51:13Z

@dfm and @superbobry could you please take a look?

superbobry

@dfm want to have a look as well?

tests/experimental_rnn_test.py

dfm

This looks good overall - thanks! My main high level comment is that it would be useful to move as much of the #ifdef JAX_GPU_HIP logic into vendor.h rather than in rnn_kernels.cc directly. It's ok to have some, but the more we can move, the better. Can you look into redefining some of the macros in vendor.h to consolidate the logic there?

dfm · 2025-01-09T13:47:00Z

jax/experimental/rnn.py

+  mlir.register_lowering(rnn_fwd_p, gpu_rnn.cudnn_rnn_fwd_lowering, platform='cuda')
+  mlir.register_lowering(rnn_fwd_p, gpu_rnn.miopen_rnn_fwd_lowering, platform='rocm')


Since gpu_rnn is in jaxlib, these changes will cause problems with version skew. JAX always needs to work with the most recent stable release of jaxlib. Perhaps you could protect this using hasattr(gpu_rnn, "miopen_rnn_fwd_lowering")?

dfm · 2025-01-09T13:47:24Z

jax/experimental/rnn.py

+  mlir.register_lowering(
+      rnn_bwd_p, gpu_rnn.miopen_rnn_bwd_lowering, platform='rocm')


Similarly, this needs to be protected against old version of jaxlib.

Ruturaj4 · 2025-01-13T23:59:11Z

@dfm still not sure why this error wouldn't go away. I have protections in place. Probably it is how you test this in your internal CI?

Seems like you are getting the jaxlib from upstream and that is why the related tests fail?

Ruturaj4 · 2025-01-14T02:58:03Z

This looks good overall - thanks! My main high level comment is that it would be useful to move as much of the #ifdef JAX_GPU_HIP logic into vendor.h rather than in rnn_kernels.cc directly. It's ok to have some, but the more we can move, the better. Can you look into redefining some of the macros in vendor.h to consolidate the logic there?

@dfm thanks. I see what you mean. However, miopen apis are quiet different from cudnn. For e.g.

#ifdef JAX_GPU_HIP
  JAX_RETURN_IF_ERROR(JAX_AS_STATUS(gpudnnSetDropoutDescriptor(
      dropout_desc, handle.get(), d.dropout, dropout_states_dev, state_size, 123, false, false,
      MIOPEN_RNG_PSEUDO_XORWOW)));
#else // JAX_GPU_CUDA
  JAX_RETURN_IF_ERROR(JAX_AS_STATUS(gpudnnSetDropoutDescriptor(
      dropout_desc, handle.get(), d.dropout, nullptr, state_size, 123)));
#endif // JAX_GPU_HIP

I checked to see how many of JAX_GPU_HIP I can move, however, seems like it is very difficult to do anything considering the significant differences between the apis. What do you think?

dfm · 2025-01-14T11:03:42Z

Seems like you are getting the jaxlib from upstream and that is why the related tests fail?

Yes! We require that jax (the Python package) always be compatible with the currently released jaxlib. You'll probably need some sort of version guard, or you can protect the jax/jaxlib boundary using something like hasattr(gpu_rnn, "...").

Also: It looks like this has introduced some build issues for the CUDA CI. Can you take a look at those too?

Ruturaj4 · 2025-01-15T01:38:49Z

@dfm I just fixed the patch. Could you please approve? thanks!

github-actions bot force-pushed the ci_rnn_final-upstream branch from 0b07837 to 36d037e Compare January 7, 2025 19:08

superbobry approved these changes Jan 8, 2025

View reviewed changes

tests/experimental_rnn_test.py Outdated Show resolved Hide resolved

tests/experimental_rnn_test.py Outdated Show resolved Hide resolved

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Jan 8, 2025

kokoro-team removed the kokoro:force-run label Jan 8, 2025

dfm requested changes Jan 9, 2025

View reviewed changes

Ruturaj4 force-pushed the ci_rnn_final-upstream branch 4 times, most recently from 2e86003 to 18cc2d2 Compare January 13, 2025 23:28

Ruturaj4 force-pushed the ci_rnn_final-upstream branch 3 times, most recently from a909942 to dfd1a65 Compare January 15, 2025 00:30

[ROCm] Implement RNN support

fe68eb8

Ruturaj4 force-pushed the ci_rnn_final-upstream branch from dfd1a65 to fe68eb8 Compare January 15, 2025 01:04

dfm approved these changes Jan 15, 2025

View reviewed changes

google-ml-butler bot added the kokoro:force-run label Jan 15, 2025

kokoro-team removed the kokoro:force-run label Jan 15, 2025

dfm self-assigned this Jan 15, 2025

copybara-service bot merged commit 41993fd into jax-ml:main Jan 15, 2025
23 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Implement RNN support #25755

[ROCm] Implement RNN support #25755

Ruturaj4 commented Jan 7, 2025

Ruturaj4 commented Jan 7, 2025

superbobry left a comment

dfm left a comment

dfm Jan 9, 2025

dfm Jan 9, 2025

Ruturaj4 commented Jan 13, 2025 •

edited

Loading

Ruturaj4 commented Jan 14, 2025

dfm commented Jan 14, 2025

Ruturaj4 commented Jan 15, 2025

		mlir.register_lowering(rnn_fwd_p, gpu_rnn.cudnn_rnn_fwd_lowering, platform='cuda')
		mlir.register_lowering(rnn_fwd_p, gpu_rnn.miopen_rnn_fwd_lowering, platform='rocm')

		mlir.register_lowering(
		rnn_bwd_p, gpu_rnn.miopen_rnn_bwd_lowering, platform='rocm')

[ROCm] Implement RNN support #25755

[ROCm] Implement RNN support #25755

Conversation

Ruturaj4 commented Jan 7, 2025

Ruturaj4 commented Jan 7, 2025

superbobry left a comment

Choose a reason for hiding this comment

dfm left a comment

Choose a reason for hiding this comment

dfm Jan 9, 2025

Choose a reason for hiding this comment

dfm Jan 9, 2025

Choose a reason for hiding this comment

Ruturaj4 commented Jan 13, 2025 • edited Loading

Ruturaj4 commented Jan 14, 2025

dfm commented Jan 14, 2025

Ruturaj4 commented Jan 15, 2025

Ruturaj4 commented Jan 13, 2025 •

edited

Loading