Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checklist for "stable" landing point #2020

Open
4 of 11 tasks
rhc54 opened this issue Oct 1, 2024 · 3 comments
Open
4 of 11 tasks

Checklist for "stable" landing point #2020

rhc54 opened this issue Oct 1, 2024 · 3 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented Oct 1, 2024

With the project winding down, it is time to define a stable landing point where we can leave it for those wanting to use it. This means:

  • removing all stale code, particularly components that aren't actively used
  • collapsing frameworks into single code directories where multiple variations are not required (e.g., rtc)
  • reducing complexity wherever possible

We'll keep a checklist here as we work thru the process - will culminate in a new PRRTE v4 release series

Code pruning and correction

  • Remove "likwid" mapper - never implemented
  • Remove "slurm" and "mpich" personalities - never fully implemented nor used
  • Collapse "rtc" framework
  • Collapse "oob" framework - consolidate the messaging system and refactor it
  • Remove "psched" tool - being replaced by external "dynasched" Python project
  • Revamp tool system - replace individual tools (e.g., "pterm") with options to "prte" itself to remove conflicts with other packages, need to design this as we must retain "prterun" and "prun" as separate cmds
  • Resolve "permanent" solution to the Slurm plm problem - use new launcher lib if it becomes available, otherwise may need to remove envar support for the internal "srun" cmd line options (see also: Slurm integration #1974)

Enhancements

  • Add PRRTE-internal resiliency support - recover connections to grandparents when parent connection is lost, restore parent connection if/when parent returns, number collective messages to ensure replay when necessary

Scheduler integration

  • Resolve question of moving scheduler integration support into separate branch
  • Complete node extension support for adding nodes on-the-fly
  • Complete session directive support - e.g., session/job preemption
@naughtont3
Copy link
Contributor

naughtont3 commented Oct 4, 2024

* [ ]  Resolve "permanent" solution to the Slurm plm problem - use new launcher lib _if_ it becomes available, otherwise may need to remove envar support for the internal "srun" cmd line options

Quick follow-up after 3oct2024 teleconf, I was mistaken and the SLURM_VERSION is not exported as an envvar within the allocation. Appears you must go through one of the utilities (e.g., srun --version, scontrol show config | grep SLURM_VERSION).

shell: $ srun --version
slurm 24.05.2
shell: $ scontrol show config | grep SLURM_VERSION
SLURM_VERSION           = 24.05.2
shell: $ echo $SLURM_VERSION

shell: $

@rhc54
Copy link
Contributor Author

rhc54 commented Oct 4, 2024

If you just get an allocation (salloc and no srun) is there anything you can see that might give us a hint as to version, even if it doesn't give us a direct value?

@naughtont3
Copy link
Contributor

Unfortunately, i do not see anything that would give an indication (salloc and then env | grep SLURM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants