Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinuxPerf, @profile, and other experiments #377

Open
willow-ahrens opened this issue Nov 1, 2024 · 6 comments
Open

LinuxPerf, @profile, and other experiments #377

willow-ahrens opened this issue Nov 1, 2024 · 6 comments

Comments

@willow-ahrens
Copy link
Collaborator

willow-ahrens commented Nov 1, 2024

This issue is to document various PRs surrounding Linuxperf and other extensible benchmarking in Benchmarktools. I've seen many great approaches, with various differences in semantics and interfaces. It seems that #375 profiles each eval loop (toggling on and off with a boolean), #347 is a generalized version of the same (unclear whether this can generalize to more than one extension at a time, such as profiling and perfing), and #325 only perfs a single execution.I recognize that different experiments require different setups. A sampling profiler requires a warmup and a minimum runtime, but probably doesn't need fancy tuning. A wall-clock time benchmark requires a warmup and a fancy eval loop where the evaluations are tuned, and maybe a gc scrub. What does LinuxPerf actually need? Are there any other experiments we also want to run (other than linuxperf?). Do we need to use metaprogramming to inline the LinuxPerf calls, or are function calls wrapping the samplefunc sufficient here?
@vchuravy @DilumAluthge @topolarity @Zentrik

@willow-ahrens
Copy link
Collaborator Author

Some discussion from Slack:
@topolarity
It is beneficial to be able to inline the LinuxPerf calls, since they can be implemented in just 1-2 instructions (although they are a syscall)
@topolarity
The current implementation should already be inlined by the compiler (at least in my proposed PR - I'm not 100% sure whether the same is true for the generalized version)
@willow-ahrens
I see, so the idea is to inline the linuxperf call to toggle instruction counting on and off, without introducing any additional overhead from function calls and stack popping, etc.
@topolarity
Yeah, exactly
@willow-ahrens
Does julia have any performance counting infrastructure beyond linuxperf we would want to be aware of?
@topoloarity
LinuxPerf needs a setup and teardown also, so that you guarantee you don't leak any PerfGroup objects (the PMU has limited resources, so we only want to actually ask it to schedule the specific measurements we need, or else it will start dropping samples)
@topolarity
Which is also why PMU-derived measurements generally need some kind of cooperation from the kernel (perf in the Linux case, or a custom driver in the Windows case - VTune is probably the most common example)

@DilumAluthge
Copy link
Member

@gbaraldi I remember you being interested in this in the past.

@willow-ahrens
Copy link
Collaborator Author

I think a few questions I have remaining are:

  1. When should setup and teardown be called for LinuxPerf? How many samples are needed, and does this match the requirements for wall-clock measurements or should we be designing a different sampling function for each different kind of experiment? I think it's plausible that we wouldn't really care about evals and tuning for performance counters.
  2. Does julia have any performance counting infrastructure beyond linuxperf we would want to be aware of?

@willow-ahrens
Copy link
Collaborator Author

willow-ahrens commented Nov 1, 2024

@vchuravy says:
It's late on a Friday here so I won't follow the discussion until Monday. One of the questions is how much do we want to be platform specific, and how willing are we to make big changes.
One question is cycles with time, cpu-time vs w all-time.
#92
Right now BT measures Wall-Time which is unreliable, but interpretable. Something like cycles (either through LinuxPerf or simply TSC) is more reliable, but harder to interpret (darn chips clocking down under heat, IPC...)
Other tools measure FLOP/s or Bytes/s (Like if, LinuxPerf) So maybe BenchmarkTools ought to provide a "specification" (e.g. @benchmarkable ) and then different tools could provide executors that measure different things.

@willow-ahrens
Copy link
Collaborator Author

willow-ahrens commented Nov 1, 2024

It's starting to seem to me that BenchmarkTools really ought to define separate "samplers" which can measure different metrics using different tools and experiment loops, and provide infrastructure to run different samplers across suites of benchmarks.

@willow-ahrens
Copy link
Collaborator Author

@vchuravy I think we should probably move forward with a short-term straightforward LinuxPerf PR like #375, (assuming we can get a few reviews on it). We would mark the feature as experimental so we can make breaking changes to it. Later, we can work towards a BenchmarkTools interface which allows for more ergonomic custom benchmarking extensions (with @benchmark defining the function to be measured, and a separate "executor" or "sampler" interface which runs an experiment on the function). The redesign would be a good opportunity to fix #339, and perhaps allow for choices between measuring or non measuring gc time, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants