Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42579: Add call to allocateNodes inside bps #44

Merged
merged 43 commits into from
Sep 25, 2024
Merged

Conversation

mxk62
Copy link
Collaborator

@mxk62 mxk62 commented Sep 17, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

In HPC centers that do not have HTCondor pools natively, the necessary
resources to execute HTCondor jobs/workflows must be provisioned
manually by the LSST users using ``allocateNodes.py`` that creates
glideins using center's native batch system like Slurm.  I added the
class that should automate this process by adding a special service node
to the HTCondor's DAGMan workflow that will be running the script
ensuring that these glideins will be created as needed automatically
during the workflow execution.
Added a mechanism allowing for storing and adding a service node
specification to the class representing an HTCondor DAG.
Added a logic which will allow the plugin to enable automatic resource
provisioning during workflow execution.
Added a module with the default settings related to provisioning
compute resources automatically during a workflow execution.
Provisioning job will always run on the access point (aka submit node).
As a result, there's no need for transferring any files. I modified the
HTCondor submit description file accordingly.
Modified attributes of the provisioning job and HTCDag so it shows up
when 'bps report' is run.
Attempting to cover different scenarios when handling the configuration
file that may be required by the provisioning script made
Provisioner.configure() overly complex. I decided to limit its scope so
it just handles two basic use cases:

1. The configuration file does not exists and needs to be created from
   scratch based on the provided BPS configuration.

2. The configuration file already exists and will be used as is.

If the user wishes to recreate the configuration file for the
provisioning script based on new/updated settings, they will need to
manually delete the existing one.
Changed ``bps report`` behavior so it will display the information
related to the resource provisioning only when it was enabled in the BPS
configuration.
Added a new section to the existing documenatation that describes
how to enable automatic provisioning of the resources.
``ctrl_execute`` is a part of the **lsst_distrib** from some time.
Removed the note which claimed otherwise.
The search options necessary to find settings for the provisioning job
were defined outside of the Provisioner class.  Made them part of the
class with on option to override them in necessary.
The methods of the Provisioner class need to be called in certain order,
but there were no safeguards to ensure it. Added them.
A wrong exception type was used in the safeguard that was supposed to
catch any issues with writing the provisioning script config.  Fixed it.
Fixed some typos, reworded selected comments, and renamed a variable
to make code easier to read.
Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 74.90909% with 69 lines in your changes missing coverage. Please review.

Project coverage is 60.86%. Comparing base (fe78af9) to head (058711b).

Files with missing lines Patch % Lines
python/lsst/ctrl/bps/htcondor/htcondor_service.py 28.26% 32 Missing and 1 partial ⚠️
python/lsst/ctrl/bps/htcondor/lssthtc.py 51.51% 26 Missing and 6 partials ⚠️
python/lsst/ctrl/bps/htcondor/provisioner.py 96.61% 1 Missing and 1 partial ⚠️
tests/test_provisioner.py 98.07% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
+ Coverage   57.41%   60.86%   +3.44%     
==========================================
  Files           8       10       +2     
  Lines        2078     2279     +201     
  Branches      366      400      +34     
==========================================
+ Hits         1193     1387     +194     
- Misses        843      845       +2     
- Partials       42       47       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

There was not an easy way to include custom command line options when
calling the provisioning script. Added a new configuration option
to address this limitation.
Copy link
Contributor

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Merge approved

doc/lsst.ctrl.bps.htcondor/userguide.rst Outdated Show resolved Hide resolved
doc/lsst.ctrl.bps.htcondor/userguide.rst Outdated Show resolved Hide resolved
doc/lsst.ctrl.bps.htcondor/userguide.rst Outdated Show resolved Hide resolved
doc/lsst.ctrl.bps.htcondor/userguide.rst Outdated Show resolved Hide resolved
doc/lsst.ctrl.bps.htcondor/userguide.rst Show resolved Hide resolved
python/lsst/ctrl/bps/htcondor/provisioner.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/htcondor/provisioner.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/htcondor/provisioner.py Outdated Show resolved Hide resolved
doc/lsst.ctrl.bps.htcondor/userguide.rst Show resolved Hide resolved
python/lsst/ctrl/bps/htcondor/htcondor_service.py Outdated Show resolved Hide resolved
When DAGMan deletes the service job on its exits this job's status is
set to DELETED.  For this reason to avoid confusing users the
provisioning job status was shown by ``bps report `` only while the
workflow was running.  As a result there is no way to easily know if the
run was using automatic provisioning of the resources after it is
finished.  Made changes to always report the provisioning job status,
but if the provisioning job was deleted by DAGMan its final status is
changed to SUCCEEDED and displayed as such.
@mxk62 mxk62 force-pushed the tickets/DM-42579 branch 2 times, most recently from 934e213 to 252429e Compare September 25, 2024 00:21
Fixed a type descriptions in _get_state_counts_from_jobs() docstring and
rename variable to follow the naming convention used in other functions.
@mxk62 mxk62 force-pushed the tickets/DM-42579 branch 2 times, most recently from 976e511 to 058711b Compare September 25, 2024 15:14
@mxk62 mxk62 merged commit c48eab7 into main Sep 25, 2024
13 checks passed
@mxk62 mxk62 deleted the tickets/DM-42579 branch September 25, 2024 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants