pytest benchmark reporting incorrect benchmark time #3753

jjsjann123 · 2025-01-23T21:30:02Z

This is a bizarre issue.

The observed behavior is that, our pytest benchmark using torch.profiler.profile seems to be non-deterministically dropping events in consecutive benchmark runs.

In PR branch #3743, running backward benchmark as a whole generates numbers like this (on H100)
running NVFUSER_DISABLE=kernel_reuse pytest --benchmark-thunder test_rope.py -k bwd

Name (time in us)                                                                    Mean                Median
-----------------------------------------------------------------------------------------------------------------------
test_rope_bwd_benchmark[executor='thunder'-variation='llama_2_7b_hf_rope']       871.5996 (5.23)       871.6050 (5.24)
test_rope_bwd_benchmark[executor='thunder'-variation='llama_3_8B_rope']        1,443.0095 (8.66)     1,442.9955 (8.67)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_mistral_nemo_rope']     166.5515 (1.0)        166.4480 (1.0)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_qwen2_rope']            386.4463 (2.32)       386.5565 (2.32)
test_rope_bwd_benchmark[executor='thunder'-variation='hf_phi3_rope']             452.3351 (2.72)       452.0685 (2.72)
-----------------------------------------------------------------------------------------------------------------------

In that example, if we comment out the other variants inside benchmarks/python/test_rope.py to run only hf_phi3 for example, we are getting numbers like

test_rope_bwd_benchmark[executor='thunder'-variation='hf_phi3_rope']           514.7512  514.4900

Further debugging went down here:

Fuser/benchmarks/python/core.py

Lines 156 to 157 in 8ea30c7

    
           prof_averages = self.prof.key_averages() 
        
           elapsed_cuda_time = self._get_kernel_time(prof_averages)

noticing that the benchmark discrepancy is coming from dropping cuda event in consecutive runs.
i.e. when we have 6 kernels running in the backward path, only 5 of them are recorded in the profiler.

The text was updated successfully, but these errors were encountered:

jjsjann123 · 2025-01-23T21:39:07Z

Noting a few things I tried so far.

For a minimum repro, I tried running the benchmark with only hf_qwen2 and hf_phi3

Dropping events are non-deterministic, in the sense that sometimes they happen and sometimes they don't. But when it happens, it's always on hf_phi3 starting on round 2 and all following rounds. It's very consistently dropping only the first kernel from events.
I played with our use of torch.profiler: i. instead of calling start()/stop(), I tried to explicitly call __enter__()/__exit__(None, None, None) to match the tutorials. I also tried to use new torch.profiler for every round of the profiling. But none of that seems to make a difference.
I also tried adding torch.cuda.synchronize() before & after every single start()/stop() calls. Which doesn't help either.

jjsjann123 · 2025-01-23T21:39:30Z

cc'ing @Priya2698

Priya2698 · 2025-01-23T21:52:58Z

@jjsjann123 do you see this issue in forward pass as well? If it is only specific to the backward pass, did it occur before PR #3394 which started setting gradients to None before each round. While that fix is valid, it could be a bug introduced from that setup call not being called properly?

Priya2698 · 2025-01-23T21:58:57Z

Additionally, did you see this discrepancy for other executors (torch.compile, eager)?

jjsjann123 · 2025-01-23T22:30:38Z

Interesting. Added this and looks like the issue only occurs when running through --benchmark-thunder.

My local run with torchcompile and eager doesn't seem to trigger any issue.

root@e905c4727c0c:/opt/pytorch/nvfuser/benchmarks/python# git diff
diff --git a/benchmarks/python/core.py b/benchmarks/python/core.py
index c56d931c..6f9412e1 100644
--- a/benchmarks/python/core.py
+++ b/benchmarks/python/core.py
@@ -129,6 +129,7 @@ class NVFBenchmark:
         # Specifies if the timer in host measurement is called at the start/finish of execution.
         # Timings are measured at the end of execution.
         self.execution_start = True
+        self.num_kernels = None

     def __call__(self, function_to_benchmark: Callable, *args, **kwargs):
         return self.benchmark(function_to_benchmark, *args, **kwargs)
@@ -181,16 +182,22 @@ class NVFBenchmark:
         """
         elapsed_cuda_time = 0
         has_cuda_event = False
+        num_kernels = 0
         for event in prof_averages:
             if event.device_type != DeviceType.CUDA:
                 continue
             has_cuda_event = True
+            num_kernels += 1
             # Re: torch profiler API changes in https://github.com/pytorch/pytorch/pull/123247
             elapsed_cuda_time = (
                 elapsed_cuda_time + event.self_device_time_total
                 if hasattr(event, "self_device_time_total")
                 else event.self_cuda_time_total
             )
+        if self.num_kernels is None:
+            self.num_kernels = num_kernels
+        else:
+            assert self.num_kernels == num_kernels
         assert has_cuda_event, "No CUDA events found"
         return elapsed_cuda_time / 1e6

Priya2698 · 2025-01-23T22:45:53Z

So no issues with thunder-torchcompile either?

jjsjann123 · 2025-01-23T23:30:01Z

Yeah, even --benchmark-thunder-torchcompile is fine.

hmmm, this makes me question my own sanity... let me double check nsys profile....

jjsjann123 · 2025-01-24T00:00:42Z

🤕 nsys profile doesn't work with torch.profile.

Adding some printf after the we did indeed call 6 kernels on backward inside nvfuser, while torch.profile only picks up 5 events.

[2025-01-23 15:59:27] test_rope.py::test_rope_bwd_benchmark[executor='thunder'-variation='hf_phi3_rope'] launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f2_c1_r0_g3_cu_6a224ba5_1329329nvfuser_pointwise_f2_c1_r0_g3ENS_6TensorINS_8__bfloatELi1ELi1EEENS0_IfLi3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f2_c1_r0_g4_cu_aa9ba9dd_1329329nvfuser_pointwise_f2_c1_r0_g4ENS_6TensorIxLi2ELi2EEENS0_IfLi3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f2_c1_r0_g8_cu_aa9ba9dd_1329329nvfuser_pointwise_f2_c1_r0_g8ENS_6TensorIfLi3ELi3EEENS0_INS_8__bfloatELi4ELi4EEES3_S3_S3_S3_S3_NS0_IfLi4ELi4EEES4_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f2_c1_r0_g7_cu_aa9ba9dd_1329329nvfuser_pointwise_f2_c1_r0_g7ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_NS0_IfLi4ELi4EEES3_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g4_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g4ENS_6TensorINS_8__bfloatELi4ELi4EEENS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g6_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g6ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g7_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g7ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g8_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g8ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_S2_NS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g9_cu_36c6ff04_1329329nvfuser_pointwise_f3_c1_r0_g9ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_S2_NS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g5_cu_36c6ff04_1329329nvfuser_pointwise_f3_c1_r0_g5ENS_6TensorINS_8__bfloatELi3ELi3EEES2_S2_S2_
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g4((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=58.209us cuda_time=58.209us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g6((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=57.792us cuda_time=57.792us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g7((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=57.312us cuda_time=57.312us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g8((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=87.137us cuda_time=87.137us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g9((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=85.473us cuda_time=85.473us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g5((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=161.953us cuda_time=161.953us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
first time kernels num_kernels=6
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g4_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g4ENS_6TensorINS_8__bfloatELi4ELi4EEENS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g6_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g6ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g7_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g7ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g8_cu_090d8191_1329329nvfuser_pointwise_f3_c1_r0_g8ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_S2_NS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g9_cu_36c6ff04_1329329nvfuser_pointwise_f3_c1_r0_g9ENS_6TensorINS_8__bfloatELi4ELi4EEES2_S2_S2_NS0_IS1_Li3ELi3EEE
launching kernels, _ZN76_GLOBAL__N__00000000_37___tmp_kernel_pointwise_f3_c1_r0_g5_cu_36c6ff04_1329329nvfuser_pointwise_f3_c1_r0_g5ENS_6TensorINS_8__bfloatELi3ELi3EEES2_S2_S2_
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g6((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=60.159us cuda_time=60.159us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g7((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=58.558us cuda_time=58.558us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g8((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=88.477us cuda_time=88.477us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g9((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 4, 4>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=86.014us cuda_time=86.014us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
<FunctionEventAvg key=(anonymous namespace)::nvfuser_pointwise_f3_c1_r0_g5((anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>, (anonymous namespace)::Tensor<(anonymous namespace)::__bfloat, 3, 3>) self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=162.556us cuda_time=162.556us input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
follwo up kernels num_kernels=5

jjsjann123 · 2025-01-24T00:51:32Z

@wujingyue mentioned about nvfuser's cupti having issue with some UCC measuring stuff. So I went a bit paranoid and removed the cupti dependency in my local build (as well as removing cupti related code in fusion_profiler.cpp)

Still doesn't seem to change anything though.

Priya2698 added bug Something isn't working Python Benchmarks labels Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytest benchmark reporting incorrect benchmark time #3753

pytest benchmark reporting incorrect benchmark time #3753

jjsjann123 commented Jan 23, 2025

jjsjann123 commented Jan 23, 2025 •

edited

Loading

jjsjann123 commented Jan 23, 2025

Priya2698 commented Jan 23, 2025

Priya2698 commented Jan 23, 2025

jjsjann123 commented Jan 23, 2025

Priya2698 commented Jan 23, 2025

jjsjann123 commented Jan 23, 2025

jjsjann123 commented Jan 24, 2025

jjsjann123 commented Jan 24, 2025

pytest benchmark reporting incorrect benchmark time #3753

pytest benchmark reporting incorrect benchmark time #3753

Comments

jjsjann123 commented Jan 23, 2025

jjsjann123 commented Jan 23, 2025 • edited Loading

jjsjann123 commented Jan 23, 2025

Priya2698 commented Jan 23, 2025

Priya2698 commented Jan 23, 2025

jjsjann123 commented Jan 23, 2025

Priya2698 commented Jan 23, 2025

jjsjann123 commented Jan 23, 2025

jjsjann123 commented Jan 24, 2025

jjsjann123 commented Jan 24, 2025

jjsjann123 commented Jan 23, 2025 •

edited

Loading