Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

feat: replace ppa with the Slurm snap #35

Merged
merged 3 commits into from
Jul 18, 2024

Conversation

jedel1043
Copy link
Contributor

Description

Integrates the Slurm snap into the codebase.

How was the code tested?

WIP

Related issues and/or tasks

Link any related issues or project board tasks to this pull request.

Checklist

  • I am the author of these changes, or I have the rights to submit them.
  • I have added the relevant changes to the README and/or documentation.
  • I have self reviewed my own code.
  • All requested changes and/or review comments have been resolved.

@jedel1043 jedel1043 changed the base branch from main to experimental July 11, 2024 20:50
@NucciTheBoss
Copy link
Member

Hmm... I'm having some trouble with getting NHC to install properly 😢

unit-compute-2: 16:07:37 INFO unit.compute/2.juju-log Running legacy hooks/install.
unit-compute-2: 16:07:57 INFO unit.compute/2.juju-log #### Installing NHC
unit-compute-2: 16:07:57 WARNING unit.compute/2.install tar: lbnl-nhc-1.4.3.tar.gz: Cannot open: No such file or directory
unit-compute-2: 16:07:57 WARNING unit.compute/2.install tar: Error is not recoverable: exiting now
unit-compute-2: 16:07:57 ERROR unit.compute/2.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-compute-2/charm/./src/charm.py", line 340, in <module>
    main.main(SlurmdCharm)
  File "/var/lib/juju/agents/unit-compute-2/charm/venv/ops/main.py", line 548, in main
    manager.run()
  File "/var/lib/juju/agents/unit-compute-2/charm/venv/ops/main.py", line 527, in run
    self._emit()
  File "/var/lib/juju/agents/unit-compute-2/charm/venv/ops/main.py", line 516, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-compute-2/charm/venv/ops/main.py", line 147, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-compute-2/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-compute-2/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-compute-2/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-compute-2/charm/./src/charm.py", line 83, in _on_install
    if self._slurmd_manager.install():
  File "/var/lib/juju/agents/unit-compute-2/charm/src/slurmd_ops.py", line 58, in install
    if not self._install_nhc_from_tarball():
  File "/var/lib/juju/agents/unit-compute-2/charm/src/slurmd_ops.py", line 93, in _install_nhc_from_tarball
    full_path = base_path / os.listdir(base_path)[0]
IndexError: list index out of range

Looks like the NHC tarball is not being properly staged which is causing tar to fail.

@NucciTheBoss NucciTheBoss self-requested a review July 13, 2024 01:57
@NucciTheBoss
Copy link
Member

Solved the NHC issue I was seeing in #37. Looks like NHC tarball currently isn't being properly primed within the charm 😅

Needed to add the part `charm: {}` to make the charm pack correctly.
jedel1043 and I found that this part declaration must be included in
`charmcraft.yaml`, or it will fail to pack the charm correctly.
nhc will be there, but the charm won't :'(

Signed-off-by: Jason C. Nucciarone <[email protected]>
Previously, if tar failed to extract the contents of the nhc
tarball to `/tmp/nhc`, _install_nhc_from_tarball would throw
an unhandled excepting that would cause the charm to bork.

Now we catch if tar fails to extract the contents of the tarball,
log the error output for the asministrator to read, and return False
so that the slurmd operator can properly handle and install failure.

Signed-off-by: Jason C. Nucciarone <[email protected]>
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just two small comments and I'm ready to merge. I already tested your branch locally against the changes I made to slurmctld experimental, so we'll worry about further polishing as we further integrate COS into the charms.

Let me know if you have any questions 🔎

This also transitions our systemd notices to a new version, since the
old version doesn't support services with dots on their names.
@jedel1043 jedel1043 marked this pull request as ready for review July 16, 2024 23:30
@jedel1043 jedel1043 requested a review from NucciTheBoss July 16, 2024 23:30
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - we can land as is and start focusing on integrating COS into the slurmctld operator.

canonical/operator-libs-linux#128 is close to landing so we won't have to carry an out-of-tree version of juju-systemd-notices for too long.

@NucciTheBoss NucciTheBoss merged commit fcdd971 into charmed-hpc:experimental Jul 18, 2024
4 of 5 checks passed
@jedel1043 jedel1043 deleted the slurm-snap branch July 18, 2024 20:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants