-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampling Refactoring #303
Sampling Refactoring #303
Conversation
@@ -420,9 +449,6 @@ def env_tracers | |||
end | |||
|
|||
# Sample | |||
# Currently the same `so` does the tracing, and the sampling | |||
# This mean that is the local rank is not part of the `traced-ranks` | |||
# No sampling will be performed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old documentation needed to be removed
sampling_daemon = SamplingDaemon.new | ||
sampling_daemon&.start(Process.pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have a start and a initialize, they are now merged
// Run the signal loop | ||
signal(SIG_SAMPLING_FINISH, signal_handler); | ||
signal(RT_SIGNAL_FINISH, signal_handler_finish); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about signal (the doc said Avoid its use: use [sigaction(2)](https://man7.org/linux/man-pages/man2/sigaction.2.html) instead. See Portability below.
).
IMO we should just copy what we did for MPI, but 🤷🏽 didn't changed it yet at it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, MPI and this should use the same code, maybe we can even de-duplicate using inlined functions in a header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name is bad, should be ze_sampling_daemon
.
I think right now --sampling
without a ze backend is broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make sense as we dont have support for the others yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we need to move it out of the folder and make it generic at some point (we could want to sample other platform counters). Each sampling backend should be activated by it's own environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kerilk I agree that associating the option with one backend is not ideal for the long term. Making it generic and adding the support for the others is good idea.
@@ -768,23 +770,24 @@ int main(int argc, char **argv) { | |||
_DL_ERROR_MSG(); | |||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will deadlock as we the host will wait for ready
. I didn't try, maybe we are luck as the non-zero exit code will trigger something and everybody will be happy and bail.
@solo2abera are you agree that we can merge this PR? If yes please approve, if not please comment :) I can work on @Kerilk comment on refactoring all the daemon together, and how to handle exit code on next PR Then in can follow up with the refactoring to handle multiple sampling "engine" Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the branch and it is good to merge.
Few minor change.
One change is that all the daemons are using the same signals (make code re-use easier). Maybe we want each daemon to have an independent set of signal numbers? Right now, it's not a problem, as we call it the Daemon serially.