-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implausible cycle count when PMU muxing is active #53
Comments
The above measurements are with the diff from #50 (comment), but this also happens using Here's a strange measurement from current master: julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 1.43e+04 59.4% # 0.0 cycles per ns
╶ instructions 1.26e+07 100.0% # 877.8 insns per cycle
╶ branch-instructions 1.07e+06 100.0% # 8.5% of insns
╶ branch-misses 1.19e+04 100.0% # 1.1% of branch insns
╶ cache-misses 1.09e+05 100.0% # 8.2% of cache refs
╶ cache-references 1.33e+06 40.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
I haven't been able to get
|
Seems to be affected by the number of events we're creating for threading (I'm running on a 64-core / 128-hyper-threads machine). This is the worst outlier with julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" threads=false rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 4.27e+06 73.0% # 1.2 cycles per ns
╶ instructions 1.03e+07 100.0% # 2.4 insns per cycle
╶ branch-instructions 6.23e+05 100.0% # 6.0% of insns
╶ branch-misses 4.94e+03 100.0% # 0.8% of branch insns
╶ cache-misses 8.85e+04 100.0% # 7.6% of cache refs
╶ cache-references 1.17e+06 27.4%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Using Weirdly enough, the situation also improves dramatically if I use |
If I group the julia> @pstats "(cpu-cycles,instructions),branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 8.71e+04 40.7% # 0.0 cycles per ns
└ instructions 2.60e+04 40.7% # 0.3 insns per cycle
╶ branch-instructions 1.01e+06 100.0% # 3900.4% of insns
╶ branch-misses 1.33e+04 100.0% # 1.3% of branch insns
╶ cache-misses 1.04e+05 100.0% # 11.7% of cache refs
╶ cache-references 8.91e+05 59.6%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Somehow these counters were active ~40% of the time, but the reading was ~100x lower than expected |
All the issues you've had have been on a single machine right? |
Yeah, that's right. I'm going to try to take some measurements tomorrow on other HW. |
I suspected this might be processor C-states, so I disabled any power states by setting In fact, on the machine I'm testing we can even get julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-loads" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 0.00e+00 31.0% # 0.0 cycles per ns
╶ instructions 2.38e+06 100.0% # Inf insns per cycle
╶ branch-instructions 1.25e+05 100.0% # 5.3% of insns
╶ branch-misses 4.50e+01 100.0% # 0.0% of branch insns
╶ cache-misses 1.21e+05 100.0% # 96.1% of cache refs
╶ cache-references 1.25e+05 100.0%
╶ L1-dcache-loads 3.01e+02 100.0%
╶ L1-dcache-load-misses 3.55e+02 100.0% # 117.9% of dcache loads
╶ L1-dcache-stores 1.25e+05 100.0%
╶ L1-dcache-loads 4.28e+02 70.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ There are also still measurements that are ≥1 order of magnitude smaller than they should be: julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-loads" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.27e+05 48.1% # 0.2 cycles per ns
...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Compared to a normal measurement: julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-loads" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.19e+06 62.1% # 0.9 cycles per ns
...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ This is an Intel Xeon system, vs. the Zen3 hw that was tested above. |
OK, it looks like there may be two sources of noise here... Above I had C-/P-states enabled and used On the other hand, if I enable normal C-/P-states and use julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-loads" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.73e+04 47.2% # 0.0 cycles per ns
... So we get bad readings for both:
The effects of having C-/P-states enabled on Both of these only seem to happen with |
OK I think I have a root cause. As it turns out, our kernel overhead for enabling / disabling perf events is much bigger than I thought. Across the several ~hundred perf fd's that we open (primarily for the ~32 openBLAS threads on my machine), it takes 1M instructions / 10M cycles / 7 milliseconds to enable and disable the perf events once. If So the time we shaved off in #38 is all on the user side, but the time in-kernel overhead strongly dominates. That would generally be OK, because perf is supposed to ignore time spent in kernel (if you set exclude_kernel as we do). However, it turns out this is not true for time_running / time_enabled (see this patch to the manpages). Those estimates always include time spent in the kernel. So the basic problem here is that we are dominated by time spent in the kernel, which skews our running time estimates and making our scaled counts much less reliable when So that's the explanation. I'm not sure what that means we should do to adapt / workaround it yet... any thoughts @vchuravy or @Zentrik ? |
Do other tools like perf deal with this in anyway? |
It's not uncommon for me to get a reading like this:
about ~10-30% of the time, compared to a more reasonable measurement:
~100 instructions per cycle seems... a little high
The text was updated successfully, but these errors were encountered: