-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytest benchmark reporting incorrect benchmark time #3753
Comments
Noting a few things I tried so far. For a minimum repro, I tried running the benchmark with only
|
cc'ing @Priya2698 |
@jjsjann123 do you see this issue in forward pass as well? If it is only specific to the backward pass, did it occur before PR #3394 which started setting gradients to |
Additionally, did you see this discrepancy for other executors (torch.compile, eager)? |
Interesting. Added this and looks like the issue only occurs when running through My local run with torchcompile and eager doesn't seem to trigger any issue.
|
So no issues with |
Yeah, even hmmm, this makes me question my own sanity... let me double check nsys profile.... |
🤕 nsys profile doesn't work with torch.profile. Adding some printf after the we did indeed call 6 kernels on backward inside nvfuser, while torch.profile only picks up 5 events.
|
@wujingyue mentioned about nvfuser's cupti having issue with some UCC measuring stuff. So I went a bit paranoid and removed the cupti dependency in my local build (as well as removing cupti related code in Still doesn't seem to change anything though. |
This is a bizarre issue.
The observed behavior is that, our pytest benchmark using
torch.profiler.profile
seems to be non-deterministically dropping events in consecutive benchmark runs.In PR branch #3743, running backward benchmark as a whole generates numbers like this (on H100)
running
NVFUSER_DISABLE=kernel_reuse pytest --benchmark-thunder test_rope.py -k bwd
In that example, if we comment out the other variants inside
benchmarks/python/test_rope.py
to run only hf_phi3 for example, we are getting numbers likeFurther debugging went down here:
Fuser/benchmarks/python/core.py
Lines 156 to 157 in 8ea30c7
noticing that the benchmark discrepancy is coming from dropping cuda event in consecutive runs.
i.e. when we have 6 kernels running in the backward path, only 5 of them are recorded in the profiler.
The text was updated successfully, but these errors were encountered: