-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Intel GPU Energy APIs #563
base: dev
Are you sure you want to change the base?
Conversation
Hi @masterleinad : thank you so much for this contribution! We don't have this system at our end to build and test on, so this is extremely helpful. I am going to need some time to review and merge this; we have no resources/funding at our end to put in significant time. I am hoping to get these GPU energy (print/JSON) APIs merged and documented in the next 2-3 weeks.
Just sending a note that this is on my list, and your contribution on the Intel GPU side is extremely helpful! Thank you! :) |
With the last commit, the output is now > ./variorum-print-verbose-energy-example
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 0, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 1, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 2, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 3, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 4, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 5, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 0, Energy: 586.303284 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 1, Energy: 587.496582 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 2, Energy: 568.604737 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 3, Energy: 560.272522 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 4, Energy: 594.283875 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 5, Energy: 572.796081 J |
Currently, I'm storing the initial energy in a global static variable. I noticed that |
Hi @masterleinad Thanks for getting to this before I could! I've been meaning to circle back to Variorum, but have been very short on time. Take a look at how we did this for IBM port [here], where we need to explicitly sample and track those as energy is not directly reported by the sensors on Power9. You can track where the variables are set from the link below. ( variorum/src/variorum/IBM/Power9.c Lines 20 to 26 in 42e0803
|
The problem is that I don't know how many GPUs I have at compile-time but I need to store data for each of them to report the difference in energy consumption since the first call. The only way to avoid allocating memory dynamically would require setting an upper limit on the number of GPUs. Would you prefer that over the current approach (possibly with registering an |
Merge after #559
Description
Corresponds to #559 for Intel GPUs using APMIDG. Outputs look like
The node has 6 GPUs with two tiles each for 12 GPU tiles in total. Thus, seeing 2 sockets with 3 devices is surprising but we should see similar output for existing Intel GPU APIs. This needs some investigation either in this pull request or elsewhere.
Also, #559 should probably be merged first.
Type of change
How Has This Been Tested?
Running on testing on ALCF's
sunspot
testbed.Checklist:
./scripts/check-code-format.sh
and confirm my code code follows the style guidelines of variorum-DENABLE_WARNINGS=ON
)