Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] Implement RNN support #25755

Merged
merged 1 commit into from
Jan 15, 2025
Merged

Conversation

Ruturaj4
Copy link
Collaborator

@Ruturaj4 Ruturaj4 commented Jan 7, 2025

Created from: ROCm#171

@Ruturaj4
Copy link
Collaborator Author

Ruturaj4 commented Jan 7, 2025

@dfm and @superbobry could you please take a look?

@github-actions github-actions bot force-pushed the ci_rnn_final-upstream branch from 0b07837 to 36d037e Compare January 7, 2025 19:08
Copy link
Collaborator

@superbobry superbobry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfm want to have a look as well?

tests/experimental_rnn_test.py Outdated Show resolved Hide resolved
tests/experimental_rnn_test.py Outdated Show resolved Hide resolved
@google-ml-butler google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Jan 8, 2025
Copy link
Collaborator

@dfm dfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall - thanks! My main high level comment is that it would be useful to move as much of the #ifdef JAX_GPU_HIP logic into vendor.h rather than in rnn_kernels.cc directly. It's ok to have some, but the more we can move, the better. Can you look into redefining some of the macros in vendor.h to consolidate the logic there?

Comment on lines 465 to 466
mlir.register_lowering(rnn_fwd_p, gpu_rnn.cudnn_rnn_fwd_lowering, platform='cuda')
mlir.register_lowering(rnn_fwd_p, gpu_rnn.miopen_rnn_fwd_lowering, platform='rocm')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since gpu_rnn is in jaxlib, these changes will cause problems with version skew. JAX always needs to work with the most recent stable release of jaxlib. Perhaps you could protect this using hasattr(gpu_rnn, "miopen_rnn_fwd_lowering")?

Comment on lines 510 to 511
mlir.register_lowering(
rnn_bwd_p, gpu_rnn.miopen_rnn_bwd_lowering, platform='rocm')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this needs to be protected against old version of jaxlib.

@Ruturaj4 Ruturaj4 force-pushed the ci_rnn_final-upstream branch 4 times, most recently from 2e86003 to 18cc2d2 Compare January 13, 2025 23:28
@Ruturaj4
Copy link
Collaborator Author

Ruturaj4 commented Jan 13, 2025

@dfm still not sure why this error wouldn't go away. I have protections in place. Probably it is how you test this in your internal CI?

Seems like you are getting the jaxlib from upstream and that is why the related tests fail?

image image

@Ruturaj4
Copy link
Collaborator Author

This looks good overall - thanks! My main high level comment is that it would be useful to move as much of the #ifdef JAX_GPU_HIP logic into vendor.h rather than in rnn_kernels.cc directly. It's ok to have some, but the more we can move, the better. Can you look into redefining some of the macros in vendor.h to consolidate the logic there?

@dfm thanks. I see what you mean. However, miopen apis are quiet different from cudnn. For e.g.

#ifdef JAX_GPU_HIP
  JAX_RETURN_IF_ERROR(JAX_AS_STATUS(gpudnnSetDropoutDescriptor(
      dropout_desc, handle.get(), d.dropout, dropout_states_dev, state_size, 123, false, false,
      MIOPEN_RNG_PSEUDO_XORWOW)));
#else // JAX_GPU_CUDA
  JAX_RETURN_IF_ERROR(JAX_AS_STATUS(gpudnnSetDropoutDescriptor(
      dropout_desc, handle.get(), d.dropout, nullptr, state_size, 123)));
#endif // JAX_GPU_HIP

I checked to see how many of JAX_GPU_HIP I can move, however, seems like it is very difficult to do anything considering the significant differences between the apis. What do you think?

@dfm
Copy link
Collaborator

dfm commented Jan 14, 2025

Seems like you are getting the jaxlib from upstream and that is why the related tests fail?

Yes! We require that jax (the Python package) always be compatible with the currently released jaxlib. You'll probably need some sort of version guard, or you can protect the jax/jaxlib boundary using something like hasattr(gpu_rnn, "...").

Also: It looks like this has introduced some build issues for the CUDA CI. Can you take a look at those too?

@Ruturaj4 Ruturaj4 force-pushed the ci_rnn_final-upstream branch 3 times, most recently from a909942 to dfd1a65 Compare January 15, 2025 00:30
@Ruturaj4 Ruturaj4 force-pushed the ci_rnn_final-upstream branch from dfd1a65 to fe68eb8 Compare January 15, 2025 01:04
@Ruturaj4
Copy link
Collaborator Author

@dfm I just fixed the patch. Could you please approve? thanks!

@dfm dfm self-assigned this Jan 15, 2025
@copybara-service copybara-service bot merged commit 41993fd into jax-ml:main Jan 15, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pull ready Ready for copybara import and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants