Skip to content

Commit

Permalink
Remove Cobalt support (#448)
Browse files Browse the repository at this point in the history
As we are not aware of any system still using the Cobalt workload manager, its support in SmartSim was terminated.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast @ashao ]
  • Loading branch information
al-rigazzi authored Jan 19, 2024
1 parent f683521 commit e107932
Show file tree
Hide file tree
Showing 50 changed files with 91 additions and 1,015 deletions.
3 changes: 1 addition & 2 deletions .wci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
Machine Learning (ML) libraries, like PyTorch and TensorFlow,
in combination with High Performance Computing (HPC) simulations and applications.
SmartSim launches ML infrastructure on HPC systems alongside user workloads
and supports most HPC workload managers (e.g. Slurm, PBSPro, LSF, Cobalt).
and supports most HPC workload managers (e.g. Slurm, PBSPro, LSF).
SmartSim also provides a set of client libraries in Python, C++, C, and Fortran.
These client libraries allow users to send and receive data between user
applications and the machine learning infrastructure. Moreover, the
Expand Down Expand Up @@ -41,7 +41,6 @@
- Slurm
- PBSPro
- LSF
- Cobalt
- Linux/MacOS
transfer_protocols:
- TCP/IP
Expand Down
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,6 @@ launch capabilities for all applications.
- Slurm
- LSF
- PBSPro
- Cobalt
- Local (for laptops/single node, no batch)


Expand All @@ -198,7 +197,7 @@ qsub -l select=3:ncpus=20 -l walltime=00:10:00 -l place=scatter -I -q <queue>
bsub -Is -W 00:10 -nnodes 3 -P <project> $SHELL
```

This same script will run on a SLURM, PBS, LSF, or Cobalt system as the ``launcher``
This same script will run on a SLURM, PBS, or LSF system as the ``launcher``
is set to `auto` in the [Experiment](https://www.craylabs.org/docs/api/smartsim_api.html#experiment)
initialization. The run command like ``mpirun``,
``aprun`` or ``srun`` will be automatically detected from what is available on the
Expand Down Expand Up @@ -277,8 +276,8 @@ print(exp.get_status(ensemble))
python hello_ensemble.py
```

Similar to the interactive example, this same script will run on a SLURM, PBS, LSF,
or Cobalt system as the ``launcher`` is set to `auto` in the
Similar to the interactive example, this same script will run on a SLURM, PBS,
or LSF system as the ``launcher`` is set to `auto` in the
[Experiment](https://www.craylabs.org/docs/api/smartsim_api.html#experiment)
initialization. Local launching does not support batch workloads.

Expand Down
38 changes: 4 additions & 34 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def print_test_configuration() -> None:

def pytest_configure() -> None:
pytest.test_launcher = test_launcher
pytest.wlm_options = ["slurm", "pbs", "cobalt", "lsf", "pals"]
pytest.wlm_options = ["slurm", "pbs", "lsf", "pals"]
account = get_account()
pytest.test_account = account
pytest.test_device = test_device
Expand Down Expand Up @@ -153,12 +153,7 @@ def kill_all_test_spawned_processes() -> None:
def get_hostlist() -> t.Optional[t.List[str]]:
global test_hostlist
if not test_hostlist:
if "COBALT_NODEFILE" in os.environ:
try:
return _parse_hostlist_file(os.environ["COBALT_NODEFILE"])
except FileNotFoundError:
return None
elif "PBS_NODEFILE" in os.environ and test_launcher == "pals":
if "PBS_NODEFILE" in os.environ and test_launcher == "pals":
# with PALS, we need a hostfile even if `aprun` is available
try:
return _parse_hostlist_file(os.environ["PBS_NODEFILE"])
Expand Down Expand Up @@ -269,27 +264,14 @@ def get_base_run_settings(
run_args = {"--np": ntasks, "--hostfile": host_file}
run_args.update(kwargs)
return RunSettings(exe, args, run_command="mpiexec", run_args=run_args)
if test_launcher == "cobalt":
if shutil.which("aprun"):
run_command = "aprun"
run_args = {"--pes": ntasks}
else:
run_command = "mpirun"
host_file = os.environ["COBALT_NODEFILE"]
run_args = {"-n": ntasks, "--hostfile": host_file}
run_args.update(kwargs)
settings = RunSettings(
exe, args, run_command=run_command, run_args=run_args
)
return settings
if test_launcher == "lsf":
run_args = {"--np": ntasks, "--nrs": nodes}
run_args.update(kwargs)
settings = RunSettings(exe, args, run_command="jsrun", run_args=run_args)
return settings
if test_launcher != "local":
raise SSConfigError(
"Base run settings are available for Slurm, PBS, Cobalt, "
"Base run settings are available for Slurm, PBS, "
f"and LSF, but launcher was {test_launcher}"
)
# TODO allow user to pick aprun vs MPIrun
Expand Down Expand Up @@ -320,18 +302,6 @@ def get_run_settings(
run_args = {"np": ntasks, "hostfile": host_file}
run_args.update(kwargs)
return PalsMpiexecSettings(exe, args, run_args=run_args)
# TODO allow user to pick aprun vs MPIrun
if test_launcher == "cobalt":
if shutil.which("aprun"):
run_args = {"pes": ntasks}
run_args.update(kwargs)
return AprunSettings(exe, args, run_args=run_args)

host_file = os.environ["COBALT_NODEFILE"]
run_args = {"n": ntasks, "hostfile": host_file}
run_args.update(kwargs)
return MpirunSettings(exe, args, run_args=run_args)

if test_launcher == "lsf":
run_args = {
"nrs": nodes,
Expand All @@ -344,7 +314,7 @@ def get_run_settings(

@staticmethod
def get_orchestrator(nodes: int = 1, batch: bool = False) -> Orchestrator:
if test_launcher in ["pbs", "cobalt"]:
if test_launcher == "pbs":
if not shutil.which("aprun"):
hostlist = get_hostlist()
else:
Expand Down
39 changes: 6 additions & 33 deletions doc/api/smartsim_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ Settings are provided to ``Model`` and ``Ensemble`` objects
to provide parameters for how a job should be executed. Some
are specifically meant for certain launchers like ``SbatchSettings``
is solely meant for system using Slurm as a workload manager.
``MpirunSettings`` for OpenMPI based jobs is supported by Slurm,
PBSPro, and Cobalt.
``MpirunSettings`` for OpenMPI based jobs is supported by Slurm
and PBSPro.


Types of Settings:
Expand All @@ -60,7 +60,6 @@ Types of Settings:
JsrunSettings
SbatchSettings
QsubBatchSettings
CobaltBatchSettings
BsubBatchSettings

Settings objects can accept a container object that defines a container
Expand Down Expand Up @@ -137,7 +136,7 @@ AprunSettings

``AprunSettings`` can be used on any system that supports the
Cray ALPS layer. SmartSim supports using ``AprunSettings``
on PBSPro and Cobalt WLM systems.
on PBSPro WLM systems.

``AprunSettings`` can be used in interactive session (on allocation)
and within batch launches (e.g., ``QsubBatchSettings``)
Expand Down Expand Up @@ -204,7 +203,7 @@ MpirunSettings


``MpirunSettings`` are for launching with OpenMPI. ``MpirunSettings`` are
supported on Slurm, PBSpro, and Cobalt.
supported on Slurm and PBSpro.


.. autosummary::
Expand All @@ -231,7 +230,7 @@ MpiexecSettings


``MpiexecSettings`` are for launching with OpenMPI's ``mpiexec``. ``MpirunSettings`` are
supported on Slurm, PBSpro, and Cobalt.
supported on Slurm and PBSpro.


.. autosummary::
Expand All @@ -258,7 +257,7 @@ OrterunSettings


``OrterunSettings`` are for launching with OpenMPI's ``orterun``. ``OrterunSettings`` are
supported on Slurm, PBSpro, and Cobalt.
supported on Slurm and PBSpro.


.. autosummary::
Expand Down Expand Up @@ -336,32 +335,6 @@ be launched as a batch on PBSPro systems.
:members:


.. _cqsub_api:


CobaltBatchSettings
-------------------

``CobaltBatchSettings`` are used to configure jobs that should
be launched as a batch on Cobalt Systems. They closely mimic
that of the ``QsubBatchSettings`` for PBSPro.


.. autosummary::

CobaltBatchSettings.set_account
CobaltBatchSettings.set_batch_command
CobaltBatchSettings.set_nodes
CobaltBatchSettings.set_queue
CobaltBatchSettings.set_walltime
CobaltBatchSettings.format_batch_args

.. autoclass:: CobaltBatchSettings
:inherited-members:
:undoc-members:
:members:


.. _bsub_api:

BsubBatchSettings
Expand Down
7 changes: 6 additions & 1 deletion doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,16 @@ To be released at some future point in time

Description

- Drop Cobalt support
- Override the sphinx-tabs extension background color
- Updated SmartSim's machine learning backends
- Added ONNX support for Python 3.10

Detailed Notes

- As the Cobalt workload manager is not used on any system we are aware of,
its support in SmartSim was terminated and classes such as `CobaltLauncher` have
been removed. (SmartSim-PR448_)
- The sphinx-tabs documentation extension uses a white background for the tabs component.
A custom CSS for those components to inherit the overall theme color has
been added. (SmartSim-PR453_)
Expand All @@ -34,6 +38,7 @@ Detailed Notes
(SmartSim-PR451_)


.. _SmartSim-PR448: https://github.com/CrayLabs/SmartSim/pull/448
.. _SmartSim-PR451: https://github.com/CrayLabs/SmartSim/pull/451
.. _SmartSim-PR453: https://github.com/CrayLabs/SmartSim/pull/453

Expand Down Expand Up @@ -454,7 +459,7 @@ Expand Machine Learning Library Support:

Expand Launcher Setting Options:

- Add ability to use base ``RunSettings`` on a Slurm, PBS, or Cobalt launchers (SmartSim-PR90_)
- Add ability to use base ``RunSettings`` on a Slurm, or PBS launchers (SmartSim-PR90_)
- Add ability to use base ``RunSettings`` on LFS launcher (SmartSim-PR108_)

Deprecations and Breaking Changes
Expand Down
10 changes: 3 additions & 7 deletions doc/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,14 +84,14 @@ Local
=====

There are two levels of testing in SmartSim. The first runs by default and does
not launch any jobs out onto a system through a workload manager like Cobalt.
not launch any jobs out onto a system through a workload manager like Slurm.

If any of the above commands are used, the test suite will run the "light" test
suite by default.


PBSPro, Slurm, Cobalt, LSF
==========================
PBSPro, Slurm, LSF
==================

To run the full test suite, users will have to be on a system with one of the
above workload managers. Additionally, users will need to obtain an allocation
Expand All @@ -105,9 +105,6 @@ of at least 3 nodes.
# for PBSPro (with aprun)
qsub -l select=3 -l place=scatter -l walltime=00:10:00 -q queue
# for Cobalt (with aprun)
qsub -n 3 -t 00:10:00 -A account -q queue -I
# for LSF (with jsrun)
bsub -Is -W 00:30 -nnodes 3 -P project $SHELL
Expand All @@ -117,7 +114,6 @@ Once in an iterative allocation, users will need to set the test launcher
environment variable: ``SMARTSIM_TEST_LAUNCHER`` to one of the following values

- slurm
- cobalt
- pbs
- lsf
- local
Expand Down
5 changes: 2 additions & 3 deletions doc/experiment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ available compute resources on the system.
Each launcher supports specific types of ``RunSettings``.

- :ref:`SrunSettings <srun_api>` for Slurm
- :ref:`AprunSettings <aprun_api>` for PBSPro and Cobalt
- :ref:`MpirunSettings <openmpi_run_api>` for OpenMPI with `mpirun` on PBSPro, Cobalt, LSF, and Slurm
- :ref:`AprunSettings <aprun_api>` for PBSPro
- :ref:`MpirunSettings <openmpi_run_api>` for OpenMPI with `mpirun` on PBSPro, LSF, and Slurm
- :ref:`JsrunSettings <jsrun_api>` for LSF

These settings can be manually specified by the user, or auto-detected by the
Expand Down Expand Up @@ -181,7 +181,6 @@ workload manager and available compute resources.

- :ref:`SbatchSettings <sbatch_api>` for Slurm
- :ref:`QsubBatchSettings <qsub_api>` for PBSPro
- :ref:`CobaltBatchSettings <cqsub_api>` for Cobalt
- :ref:`BsubBatchSettings <bsub_api>` for LSF

If it only passed ``RunSettings``, ``Ensemble``, objects will require either
Expand Down
40 changes: 3 additions & 37 deletions doc/launchers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,8 @@ SmartSim currently supports 5 `launchers`:
1. ``local``: for single-node, workstation, or laptop
2. ``slurm``: for systems using the Slurm scheduler
3. ``pbs``: for systems using the PBSpro scheduler
4. ``cobalt``: for systems using the Cobalt scheduler
5. ``lsf``: for systems using the LSF scheduler
6. ``auto``: have SmartSim auto-detect the launcher to use.
4. ``lsf``: for systems using the LSF scheduler
5. ``auto``: have SmartSim auto-detect the launcher to use.

To specify a specific launcher, one argument needs to be provided
to the ``Experiment`` initialization.
Expand All @@ -30,7 +29,6 @@ to the ``Experiment`` initialization.
exp = Experiment("name-of-experiment", launcher="local") # local launcher
exp = Experiment("name-of-experiment", launcher="slurm") # Slurm launcher
exp = Experiment("name-of-experiment", launcher="pbs") # PBSpro launcher
exp = Experiment("name-of-experiment", launcher="cobalt") # Cobalt launcher
exp = Experiment("name-of-experiment", launcher="lsf") # LSF launcher
exp = Experiment("name-of-experiment", launcher="auto") # auto-detect launcher
Expand Down Expand Up @@ -219,42 +217,10 @@ creation.

---------------------------------------------------------------------

Cobalt
======

The Cobalt Launcher works just like the PBSPro launcher and
is compatible with ALPS and OpenMPI workloads as well.

To use the Cobalt launcher, specify at ``Experiment`` initialization:

.. code-block:: python
from smartsim import Experiment
exp = Experiment("MOM6-double-gyre", launcher="cobalt")
Running on Cobalt
-----------------

The Cobalt launcher supports three types of ``RunSettings``:
1. :ref:`AprunSettings <aprun_api>`
2. :ref:`MpirunSettings <openmpi_run_api>`
3. :ref:`MpiexecSettings <openmpi_exec_api>`

As well as batch settings for ``qsub`` through:
1. :ref:`CobaltBatchSettings <cqsub_api>`

Both supported ``RunSettings`` types above can be added
to a ``CobaltBatchSettings`` batch workload through ``Ensemble``
creation.

---------------------------------------------------------------------

LSF
===

The LSF Launcher works like the PBSPro and Cobalt launchers and
The LSF Launcher works like the PBSPro launcher and
is compatible with LSF and OpenMPI workloads.

To use the LSF launcher, specify at ``Experiment`` initialization:
Expand Down
3 changes: 1 addition & 2 deletions doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,7 @@ The key features of the IL are:
- An API to start, monitor, and stop HPC jobs from Python or from a Jupyter notebook.
- Automated deployment of in-memory data staging (`Redis <https://redis.io>`_) and computational
storage (`RedisAI <https://redisai.io>`_).
- Programmatic launches of batch and in-allocation jobs on PBS, Slurm, LSF,
and Cobalt systems.
- Programmatic launches of batch and in-allocation jobs on PBS, Slurm, and LSF systems.
- Creating and configuring ensembles of workloads with isolated communication channels.

The IL can configure and launch batch jobs as well as jobs within interactive
Expand Down
5 changes: 0 additions & 5 deletions doc/testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,6 @@ Examples of how to obtain allocations on systems with the launchers:
# for PBSPro (with aprun)
qsub -l select=4 -l place=scatter -l walltime=00:10:00 -q queue
# for Cobalt (with aprun)
qsub -n 4 -t 00:10:00 -A account -q queue -I
# for LSF (with jsrun)
bsub -Is -W 00:30 -nnodes 4 -P project $SHELL
Expand All @@ -91,7 +88,6 @@ launcher environment variable: ``SMARTSIM_TEST_LAUNCHER`` to one
of the following values

- slurm
- cobalt
- pbs
- lsf
- local
Expand Down Expand Up @@ -273,4 +269,3 @@ The actions are defined using yaml files are are located in the
Each pull request, push and merge the test suite for SmartRedis
and SmartSim are run. For SmartSim, this is the ``local`` test suite
with the local launcher.

Loading

0 comments on commit e107932

Please sign in to comment.