Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling Refactoring #303

Merged
merged 2 commits into from
Nov 27, 2024
Merged

Sampling Refactoring #303

merged 2 commits into from
Nov 27, 2024

Conversation

TApplencourt
Copy link
Collaborator

Few minor change.

One change is that all the daemons are using the same signals (make code re-use easier). Maybe we want each daemon to have an independent set of signal numbers? Right now, it's not a problem, as we call it the Daemon serially.

@@ -420,9 +449,6 @@ def env_tracers
end

# Sample
# Currently the same `so` does the tracing, and the sampling
# This mean that is the local rank is not part of the `traced-ranks`
# No sampling will be performed
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old documentation needed to be removed

Comment on lines -817 to -818
sampling_daemon = SamplingDaemon.new
sampling_daemon&.start(Process.pid)
Copy link
Collaborator Author

@TApplencourt TApplencourt Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have a start and a initialize, they are now merged

// Run the signal loop
signal(SIG_SAMPLING_FINISH, signal_handler);
signal(RT_SIGNAL_FINISH, signal_handler_finish);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about signal (the doc said Avoid its use: use [sigaction(2)](https://man7.org/linux/man-pages/man2/sigaction.2.html) instead. See Portability below. ).
IMO we should just copy what we did for MPI, but 🤷🏽 didn't changed it yet at it work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, MPI and this should use the same code, maybe we can even de-duplicate using inlined functions in a header.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name is bad, should be ze_sampling_daemon.
I think right now --sampling without a ze backend is broken

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It make sense as we dont have support for the others yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we need to move it out of the folder and make it generic at some point (we could want to sample other platform counters). Each sampling backend should be activated by it's own environment variable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kerilk I agree that associating the option with one backend is not ideal for the long term. Making it generic and adding the support for the others is good idea.

@@ -768,23 +770,24 @@ int main(int argc, char **argv) {
_DL_ERROR_MSG();
return 1;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will deadlock as we the host will wait for ready. I didn't try, maybe we are luck as the non-zero exit code will trigger something and everybody will be happy and bail.

@TApplencourt
Copy link
Collaborator Author

@solo2abera are you agree that we can merge this PR? If yes please approve, if not please comment :)

I can work on @Kerilk comment on refactoring all the daemon together, and how to handle exit code on next PR

Then in can follow up with the refactoring to handle multiple sampling "engine"

Thanks!

Copy link
Collaborator

@sbekele81 sbekele81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the branch and it is good to merge.

@sbekele81 sbekele81 merged commit a4a0e01 into devel Nov 27, 2024
16 checks passed
@TApplencourt TApplencourt deleted the sampling_refac branch February 13, 2025 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants