Skip to content

Commit

Permalink
Merge pull request #44 from lsst/tickets/DM-42579
Browse files Browse the repository at this point in the history
DM-42579: Add call to allocateNodes inside bps
  • Loading branch information
mxk62 authored Sep 25, 2024
2 parents fe78af9 + 058711b commit c48eab7
Show file tree
Hide file tree
Showing 8 changed files with 781 additions and 115 deletions.
1 change: 1 addition & 0 deletions doc/changes/DM-42579.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added ability for the plugin to call ``allocateNodes.py`` during workflow execution in order to manage required computational resources automatically.
160 changes: 155 additions & 5 deletions doc/lsst.ctrl.bps.htcondor/userguide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,11 +81,6 @@ available in your BPS configuration file. For example:
requirements: '(ALLOCATED_NODE_SET == "${NODESET}")'
+JOB_NODE_SET: '"${NODESET}"'
.. note::

Package `ctrl_execute`_ is not the part of the `lsst_distrib`_ metapackage
and it needs to be (as well as its dependencies) installed manually.

.. __: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#bps-configuration-file
.. __: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#supported-settings

Expand Down Expand Up @@ -194,6 +189,138 @@ initially run with 2 GB of memory and failed because of exceeding the limit,
result the entire workflow fails again due to other reasons, the job will ask
for 2 GB of memory during the first execution after the workflow is restarted.

.. _htc-plugin-provisioning:

Provisioning resources automatically
------------------------------------

Computational resources required to execute a workflow may not always be
managed directly by HTCondor and may need to be provisioned first by a
different workload manager, for example, `Slurm`_. In such a case
**ctrl_bps_htcondor** can be instructed to run a provisioning job alongside of
the workflow which will firstly create and then maintain `glideins`__ necessary
for the execution of the workflow.

This provisioning job is called ``provisioning_job.bash`` and is managed by
HTCondor. Be careful not to remove it by accident when using ``condor_rm`` or
``kill`` command. The job is run on a best-effort basis and will not be
automatically restarted once deleted.

To enable automatic provisioning of the resources, add the following settings to
your BPS configuration:

.. code-block:: yaml
provisionResources: true
provisioning:
provisioningMaxWallTime: <value>
where ``<value>`` is the approximate time your workflow needs to complete,
e.g., 3600, 10:00:00.

This will instruct **ctrl_bps_htcondor** to include a service job that will run
alongside the other payload jobs in the workflow that should automatically
create and maintain glideins required for the payload jobs to run.

If you enable automatic provisioning of resources, you will see the status of
the provisioning job in the output of the ``bps report --id <id>`` command.
Look for the line starting with "Provisioning job status". For example

.. code-block:: bash
:emphasize-lines: 8
X STATE %S ID OPERATOR PROJECT CAMPAIGN PAYLOAD RUN
--- ------- --- ----- -------- ------- -------- ------- ---------------------------------------
RUNNING 0 1.0 jdoe dev quick pcheck u_jdoe_pipelines_check_20240924T201447Z
Path: /home/jdoe/submit/u/jdoe/pipelines_check/20240924T201447Z
Global job id: node001#1.0#1727208891
Provisioning job status: RUNNING
UNKNOWN MISFIT UNREADY READY PENDING RUNNING DELETED HELD SUCCEEDED FAILED PRUNED EXPECTED
----------------- ------- ------ ------- ----- ------- ------- ------- ---- --------- ------ ------ --------
TOTAL 0 0 4 0 1 0 0 0 0 0 0 5
----------------- ------- ------ ------- ----- ------- ------- ------- ---- --------- ------ ------ --------
pipetaskInit 0 0 0 0 1 0 0 0 0 0 0 1
isr 0 0 1 0 0 0 0 0 0 0 0 1
characterizeImage 0 0 1 0 0 0 0 0 0 0 0 1
calibrate 0 0 1 0 0 0 0 0 0 0 0 1
finalJob 0 0 1 0 0 0 0 0 0 0 0 1
The service job managing the glideins will be automatically canceled once the
workflow is completed. However, the existing glideins will be left for
HTCondor to shut them down once they remain inactive for the period specified
by ``provisioningMaxIdleTime`` (default value: 10 min., see below) or maximum
wall time is reached.

If the automatic provisioning of the resources is enabled, the script that the
service job is supposed to run in order to provide the required resources *must
be* defined by the ``provisioningScript`` setting in the ``provisioning``
section of your BPS configuration file. By default, **ctrl_bps_htcondor** will
use ``allocateNodes.py`` from `ctrl_execute`_ package with the following
settings:

.. code-block:: yaml
provisioning:
provisioningNodeCount: 10
provisioningMaxIdleTime: 900
provisioningCheckInterval: 600
provisioningQueue: "milano"
provisioningAccountingUser: "rubin:developers"
provisioningExtraOptions: ""
provisioningPlatform: "s3df"
provisioningScript: |
#!/bin/bash
set -e
set -x
while true; do
${CTRL_EXECUTE_DIR}/bin/allocateNodes.py \
--account {provisioningAccountingUser} \
--auto \
--node-count {provisioningNodeCount} \
--maximum-wall-clock {provisioningMaxWallTime} \
--glidein-shutdown {provisioningMaxIdleTime} \
--queue {provisioningQueue} \
{provisioningExtraOptions} \
{provisioningPlatform}
sleep {provisioningCheckInterval}
done
exit 0
``allocateNodes.py`` requires a small configuration file located in the user's
directory to work. With automatic provisioning enabled **ctrl_bps_htcondor**
will create a new file if it does not exist at the location defined by
``provisioningScriptConfigPath`` using the template defined by
``provisioningScriptConfig`` settings in the ``provisioning`` section:

.. code-block:: yaml
provisioning:
provisioningScriptConfig: |
config.platform["{provisioningPlatform}"].user.name="${USER}"
config.platform["{provisioningPlatform}"].user.home="${HOME}"
provisioningScriptConfigPath: "${HOME}/.lsst/condor-info.py"
If you're using a custom provisioning script that does not require any
external configuration, set ``provisioningScriptConfig`` to an empty string.

If the file already exists, it will be used as is (BPS will not update it with
config settings). If you wish BPS to overwrite the file with the
``provisioningScriptConfig`` values, you need to manually remove or rename the
existing file.

.. note::

``${CTRL_BPS_HTCONDOR_DIR}/python/lsst/ctrl/bps/htcondor/etc/htcondor_defaults.yaml``
contains default values used by every bps submission when using
``ctrl_bps_htcondor`` plugin that are automatically included in your
submission configuration.

.. __: https://htcondor.readthedocs.io/en/latest/codes-other-values/glossary.html#term-Glidein

.. _htc-plugin-troubleshooting:

Troubleshooting
Expand All @@ -210,8 +337,31 @@ Why did my submission fail?
Check the ``*.dag.dagman.out`` in run submit directory for errors, in
particular for ``ERROR: submit attempt failed``.

I enabled automatic provisioning, but my jobs still sit idle in the queue!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The service node responsible for executing the provisioning script runs on a
best-effort basis. If this node fails to submit correctly or crashes during
the workflow execution, this will not register as an error and the workflow
will continue normally until the existing gliedins expire. As a result,
payload jobs may get stuck in the job queue if the glideins were not created
or expired before the execution of the workflow could be completed.

Firstly, use ``bps report --id <run id>`` to display the run report and look
for the line

.. code-block::
Provisioning job status: <status>
If the ``<status>`` is different from RUNNING, it means that the automatic
provisioning is not working. In such a case, create `glideins manually`__ to
complete your run.

.. __: https://developer.lsst.io/usdf/batch.html#ctrl-bps-htcondor

.. _HTCondor: https://htcondor.readthedocs.io/en/latest/
.. _Slurm: https://slurm.schedmd.com/overview.html
.. _bps cancel: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#canceling-submitted-jobs
.. _bps report: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#checking-status
.. _bps restart: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#restarting-a-failed-run
Expand Down
Empty file.
34 changes: 34 additions & 0 deletions python/lsst/ctrl/bps/htcondor/etc/htcondor_defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,37 @@
#
# https://s3df.slac.stanford.edu/public/doc/#/batch-compute
memoryLimit: 491520

# Define how to provision resources automatically.
provisioning:
provisioningNodeCount: 10
provisioningMaxIdleTime: 900
provisioningCheckInterval: 600
provisioningQueue: "milano"
provisioningAccountingUser: "rubin:developers"
provisioningExtraOptions: ""
provisioningPlatform: "s3df"
provisioningScript: |
#!/bin/bash
set -e
set -x
while true; do
${CTRL_EXECUTE_DIR}/bin/allocateNodes.py \
--account {provisioningAccountingUser} \
--auto \
--node-count {provisioningNodeCount} \
--maximum-wall-clock {provisioningMaxWallTime} \
--glidein-shutdown {provisioningMaxIdleTime} \
--queue {provisioningQueue} \
{provisioningExtraOptions} \
{provisioningPlatform}
sleep {provisioningCheckInterval}
done
exit 0
provisioningScriptConfig: |
config.platform["{provisioningPlatform}"].user.name="${USER}"
config.platform["{provisioningPlatform}"].user.home="${HOME}"
provisioningScriptConfigPath: "${HOME}/.lsst/condor-info.py"

# By default, disable automatic provisioning of resources.
provisionResources: false
Loading

0 comments on commit c48eab7

Please sign in to comment.