Skip to content

[BUGFIX][0.5.0-UT] Skip CSR matmat and matvec float tests on ROCm <6.4 (NaN issue with beta==0) #380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: rocm-jaxlib-v0.5.0
Choose a base branch
from

Conversation

psanal35
Copy link

Older versions of rocSPARSE (<6.4) did not zero out memory when beta==0, which could allow NaNs to propagate through

self.skipTest("skipping int32 type tests")
rocm_ver = get_rocm_version()
if rocm_ver < (6, 4) and dtype in [np.float32, np.complex64]:
self.skipTest("ROCm <6.4 bug: NaN propagation when beta==0 (fixed in ROCm 6.4)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in rocm 6.5?

self.skipTest("skipping int32 type tests")
rocm_ver = get_rocm_version()
if rocm_ver < (6, 4) and dtype in [np.float32, np.complex64]:
self.skipTest("ROCm <6.4 bug: NaN propagation when beta==0 (fixed in ROCm 6.4)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 6.5

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was actually fixed in 6.4.0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if it's 6.4.0 or 6.4? The tests are skipped for versions <6.4.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking against (6, 4) is sufficient i think

from pathlib import Path

def get_rocm_version():
version_path = Path("/opt/rocm/.info/version")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will probably want to check ROCM_PATH from os.environ first, and then secondarily this path as the fallback.


def get_rocm_version():
version_path = Path("/opt/rocm/.info/version")
assert version_path.exists(), ("Expected ROCm version file")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asserts get ignored in optimized python bytecode, so it would be better to make this an if not version_path.exists(): and throw an Exception

self.skipTest("skipping int32 type tests")
rocm_ver = get_rocm_version()
if rocm_ver < (6, 4) and dtype in [np.float32, np.complex64]:
self.skipTest("ROCm <6.4 bug: NaN propagation when beta==0 (fixed in ROCm 6.4)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was actually fixed in 6.4.0?

@mrodden
Copy link
Collaborator

mrodden commented Apr 23, 2025

Can we drop the [BUGFIX] at the beginning of the commit message? I think its pretty obvious that generally changes are fixing bugs. The rest of the message makes sense to me as well.

@psanal35 psanal35 force-pushed the bugfix-csr-beta0-nan-0.5.0-ut branch from efd56ee to ee8b6c5 Compare April 23, 2025 16:34
@psanal35 psanal35 force-pushed the bugfix-csr-beta0-nan-0.5.0-ut branch from 7632093 to acc7cf7 Compare April 23, 2025 19:29
@psanal35 psanal35 requested review from Ruturaj4 and mrodden April 23, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants