Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oss517 #356

Merged
merged 216 commits into from
Nov 9, 2023
Merged

Oss517 #356

merged 216 commits into from
Nov 9, 2023

Conversation

ankona
Copy link
Contributor

@ankona ankona commented Sep 7, 2023

No description provided.

@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

Merging #356 (011434e) into dashboard (266b3c6) will increase coverage by 0.11%.
The diff coverage is 94.59%.

Impacted file tree graph

@@              Coverage Diff              @@
##           dashboard     #356      +/-   ##
=============================================
+ Coverage      90.00%   90.11%   +0.11%     
=============================================
  Files             60       60              
  Lines           3710     3802      +92     
=============================================
+ Hits            3339     3426      +87     
- Misses           371      376       +5     
Files Coverage Δ
smartsim/_core/config/config.py 98.66% <100.00%> (+0.11%) ⬆️
smartsim/_core/launcher/launcher.py 100.00% <ø> (ø)
smartsim/_core/launcher/local/local.py 96.82% <100.00%> (+2.82%) ⬆️
smartsim/_core/launcher/step/mpiStep.py 88.73% <100.00%> (ø)
smartsim/_core/launcher/step/step.py 100.00% <100.00%> (ø)
smartsim/_core/utils/serialize.py 96.96% <100.00%> (-0.26%) ⬇️
smartsim/entity/dbnode.py 91.96% <100.00%> (-0.15%) ⬇️
smartsim/_core/control/controller.py 86.47% <96.55%> (+0.40%) ⬆️
smartsim/_core/control/job.py 93.93% <95.65%> (+0.51%) ⬆️
smartsim/_core/control/jobmanager.py 93.28% <83.33%> (-0.76%) ⬇️
... and 2 more

@ankona ankona force-pushed the oss517 branch 3 times, most recently from 51a2630 to edda5d7 Compare October 19, 2023 14:54
@ankona ankona changed the base branch from develop to dashboard October 19, 2023 19:14
@ankona ankona force-pushed the oss517 branch 4 times, most recently from 9cf4e11 to 197e75e Compare October 24, 2023 23:23
@ankona ankona marked this pull request as ready for review October 26, 2023 01:29
@MattToast
Copy link
Member

One last thing to consider, and for sake of time I'm going to request this be kicked into a new ticket/PR!

It looks like unmanaged tasks need to launched through the indirect.py entry-point. We are successfully launching everything through the local launcher and that seems to be working perfectly! Unfortunately, it looks like we do not have a similar method to tack unamanged steps that may have been launched via a WLM:

For testing consider the following driver script:

from smartsim import Experiment

def main() -> int:
    slurm = "slurm"      # <-- I believe this problem exists for any WLM
    open_mpi = "mpirun"  #     and run command combo that produces an
                         #     unmanaged step
    exp = Experiment("my-exp", launcher=slurm)

    rs = exp.create_run_settings("echo", ["hello", "world"],
                                 run_command=open_mpi)
    rs.set_tasks(1)
    model = exp.create_model("my-model", run_settings=rs)

    exp.generate(model, overwrite=True)
    exp.start(model, block=True)  # <-- Does not look like any start or
                                  #     stop jsons are produced bc the
                                  #     unmanaged indirect is never invoked
    print("Done")
    return 0

if __name__ == "__main__":
    raise SystemExit(main())

While a slurm/mpirun pairing tends to be wayyyy less popular than the usual slurm/srun configuration, it is not uncommon for users use a PBS/mpi{exec,run} pairing, so it may be worth testing that explicitly as well!

@ankona
Copy link
Contributor Author

ankona commented Nov 7, 2023

One last thing to consider, and for sake of time I'm going to request this be kicked into a new ticket/PR!

While a slurm/mpirun pairing tends to be wayyyy less popular than the usual slurm/srun configuration, it is not uncommon for users use a PBS/mpi{exec,run} pairing, so it may be worth testing that explicitly as well!

Added a follow-up ticket to give this the love it deserves.

@MattToast MattToast self-requested a review November 8, 2023 18:55
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing and seems to be working great! I left a couple suuuupppeer pedantic change requests and some notes for either you/I to write up some tickets for some known defects that we want to address, but nothing worth holding up approval over!

Otherwise LGTM!! Thanks for all of the hard work on this!!

Comment on lines 567 to +570
def _create_batch_job_step(
self, entity_list: t.Union[Orchestrator, Ensemble, _AnonymousBatchJob]
self,
entity_list: t.Union[Orchestrator, Ensemble, _AnonymousBatchJob],
telemetry_dir: pathlib.Path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same "don't love, but good enough for now" concern with param diving the telemetry_dir: @MattToast and/or @ankona remember to ticket this as well!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ankona ankona merged commit 59ab617 into CrayLabs:dashboard Nov 9, 2023
@ankona ankona deleted the oss517 branch June 3, 2024 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants