Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytest benchmark reporting incorrect benchmark time #3753

Open
jjsjann123 opened this issue Jan 23, 2025 · 9 comments
Open

pytest benchmark reporting incorrect benchmark time #3753

jjsjann123 opened this issue Jan 23, 2025 · 9 comments
Labels
bug Something isn't working Python Benchmarks

Comments

@jjsjann123
Copy link
Collaborator

This is a bizarre issue.

The observed behavior is that, our pytest benchmark using torch.profiler.profile seems to be non-deterministically dropping events in consecutive benchmark runs.

In PR branch #3743, running backward benchmark as a whole generates numbers like this (on H100)
running NVFUSER_DISABLE=kernel_reuse pytest --benchmark-thunder test_rope.py -k bwd

Name (time in us)                                                                    Mean                Median
-----------------------------------------------------------------------------------------------------------------------
test_rope_bwd_benchmark[executor='thunder'-variation='llama_2_7b_hf_rope']       871.5996 (5.23)       871.6050 (5.24)
test_rope_bwd_benchmark[executor='thunder'-variation='llama_3_8B_rope']        1,443.0095 (8.66)     1,442.9955 (8.67)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_mistral_nemo_rope']     166.5515 (1.0)        166.4480 (1.0)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_qwen2_rope']            386.4463 (2.32)       386.5565 (2.32)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_phi3_rope']             452.3351 (2.72)       452.0685 (2.72)
-----------------------------------------------------------------------------------------------------------------------

In that example, if we comment out the other variants inside benchmarks/python/test_rope.py to run only hf_phi3 for example, we are getting numbers like

test_rope_bwd_benchmark[executor='thunder'-variation='hf_phi3_rope']           514.7512  514.4900

Further debugging went down here:

prof_averages = self.prof.key_averages()
elapsed_cuda_time = self._get_kernel_time(prof_averages)

noticing that the benchmark discrepancy is coming from dropping cuda event in consecutive runs.
i.e. when we have 6 kernels running in the backward path, only 5 of them are recorded in the profiler.

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Jan 23, 2025

Noting a few things I tried so far.

For a minimum repro, I tried running the benchmark with only hf_qwen2 and hf_phi3

  1. Dropping events are non-deterministic, in the sense that sometimes they happen and sometimes they don't. But when it happens, it's always on hf_phi3 starting on round 2 and all following rounds. It's very consistently dropping only the first kernel from events.

  2. I played with our use of torch.profiler: i. instead of calling start()/stop(), I tried to explicitly call __enter__()/__exit__(None, None, None) to match the tutorials. I also tried to use new torch.profiler for every round of the profiling. But none of that seems to make a difference.

  3. I also tried adding torch.cuda.synchronize() before & after every single start()/stop() calls. Which doesn't help either.

@jjsjann123
Copy link
Collaborator Author

cc'ing @Priya2698

@Priya2698 Priya2698 added bug Something isn't working Python Benchmarks labels Jan 23, 2025
@Priya2698
Copy link
Collaborator

@jjsjann123 do you see this issue in forward pass as well? If it is only specific to the backward pass, did it occur before PR #3394 which started setting gradients to None before each round. While that fix is valid, it could be a bug introduced from that setup call not being called properly?

@Priya2698
Copy link
Collaborator

Additionally, did you see this discrepancy for other executors (torch.compile, eager)?

@jjsjann123
Copy link
Collaborator Author

Interesting. Added this and looks like the issue only occurs when running through --benchmark-thunder.

My local run with torchcompile and eager doesn't seem to trigger any issue.

root@e905c4727c0c:/opt/pytorch/nvfuser/benchmarks/python# git diff
diff --git a/benchmarks/python/core.py b/benchmarks/python/core.py
index c56d931c..6f9412e1 100644
--- a/benchmarks/python/core.py
+++ b/benchmarks/python/core.py
@@ -129,6 +129,7 @@ class NVFBenchmark:
         # Specifies if the timer in host measurement is called at the start/finish of execution.
         # Timings are measured at the end of execution.
         self.execution_start = True
+        self.num_kernels = None

     def __call__(self, function_to_benchmark: Callable, *args, **kwargs):
         return self.benchmark(function_to_benchmark, *args, **kwargs)
@@ -181,16 +182,22 @@ class NVFBenchmark:
         """
         elapsed_cuda_time = 0
         has_cuda_event = False
+        num_kernels = 0
         for event in prof_averages:
             if event.device_type != DeviceType.CUDA:
                 continue
             has_cuda_event = True
+            num_kernels += 1
             # Re: torch profiler API changes in https://github.com/pytorch/pytorch/pull/123247
             elapsed_cuda_time = (
                 elapsed_cuda_time + event.self_device_time_total
                 if hasattr(event, "self_device_time_total")
                 else event.self_cuda_time_total
             )
+        if self.num_kernels is None:
+            self.num_kernels = num_kernels
+        else:
+            assert self.num_kernels == num_kernels
         assert has_cuda_event, "No CUDA events found"
         return elapsed_cuda_time / 1e6

@Priya2698
Copy link
Collaborator

So no issues with thunder-torchcompile either?

@jjsjann123
Copy link
Collaborator Author

Yeah, even --benchmark-thunder-torchcompile is fine.

hmmm, this makes me question my own sanity... let me double check nsys profile....

@jjsjann123
Copy link
Collaborator Author

🤕 nsys profile doesn't work with torch.profile.

Adding some printf after the we did indeed call 6 kernels on backward inside nvfuser, while torch.profile only picks up 5 events.

[2025-01-23 15:59:27] test_rope.py::test_rope_bwd_benchmark[executor='thunder'-variation='hf_phi3_rope'] launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f2_c1_r0_g3_cu_6a224ba5_1329329nvfuser_pointwise_f2_c1_r0_g3ENS_6TensorINS_8__bfloatELi1ELi1EEENS0_IfLi3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f2_c1_r0_g4_cu_aa9ba9dd_1329329nvfuser_pointwise_f2_c1_r0_g4ENS_6TensorIxLi2ELi2EEENS0_IfLi3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f2_c1_r0_g8_cu_aa9ba9dd_1329329nvfuser_pointwise_f2_c1_r0_g8ENS_6TensorIfLi3ELi3EEENS0_INS_8__bfloatELi4ELi4EEES3_S3_S3_S3_S3_NS0_IfLi4ELi4EEES4_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f2_c1_r0_g7_cu_aa9ba9dd_1329329nvfuser_pointwise_f2_c1_r0_g7ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_NS0_IfLi4ELi4EEES3_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g4_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g4ENS_6TensorINS_8__bfloatELi4ELi4EEENS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g6_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g6ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g7_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g7ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g8_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g8ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_S2_NS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g9_cu_36c6ff04_1329329nvfuser_pointwise_f3_c1_r0_g9ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_S2_NS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g5_cu_36c6ff04_1329329nvfuser_pointwise_f3_c1_r0_g5ENS_6TensorINS_8__bfloatELi3ELi3EEES2_S2_S2_
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g4((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=58.209us cuda_time=58.209us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g6((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=57.792us cuda_time=57.792us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g7((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=57.312us cuda_time=57.312us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g8((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=87.137us cuda_time=87.137us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g9((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=85.473us cuda_time=85.473us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g5((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=161.953us cuda_time=161.953us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
first time kernels num_kernels=6
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g4_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g4ENS_6TensorINS_8__bfloatELi4ELi4EEENS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g6_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g6ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g7_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g7ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g8_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g8ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_S2_NS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g9_cu_36c6ff04_1329329nvfuser_pointwise_f3_c1_r0_g9ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_S2_NS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g5_cu_36c6ff04_1329329nvfuser_pointwise_f3_c1_r0_g5ENS_6TensorINS_8__bfloatELi3ELi3EEES2_S2_S2_
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g6((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=60.159us cuda_time=60.159us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g7((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=58.558us cuda_time=58.558us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g8((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=88.477us cuda_time=88.477us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g9((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=86.014us cuda_time=86.014us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g5((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=162.556us cuda_time=162.556us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
follwo up kernels num_kernels=5

@jjsjann123
Copy link
Collaborator Author

@wujingyue mentioned about nvfuser's cupti having issue with some UCC measuring stuff. So I went a bit paranoid and removed the cupti dependency in my local build (as well as removing cupti related code in fusion_profiler.cpp)

Still doesn't seem to change anything though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Benchmarks
Projects
None yet
Development

No branches or pull requests

2 participants