-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-42579: Add call to allocateNodes inside bps #44
Conversation
In HPC centers that do not have HTCondor pools natively, the necessary resources to execute HTCondor jobs/workflows must be provisioned manually by the LSST users using ``allocateNodes.py`` that creates glideins using center's native batch system like Slurm. I added the class that should automate this process by adding a special service node to the HTCondor's DAGMan workflow that will be running the script ensuring that these glideins will be created as needed automatically during the workflow execution.
Added a mechanism allowing for storing and adding a service node specification to the class representing an HTCondor DAG.
Added a logic which will allow the plugin to enable automatic resource provisioning during workflow execution.
Added a module with the default settings related to provisioning compute resources automatically during a workflow execution.
Provisioning job will always run on the access point (aka submit node). As a result, there's no need for transferring any files. I modified the HTCondor submit description file accordingly.
Modified attributes of the provisioning job and HTCDag so it shows up when 'bps report' is run.
Attempting to cover different scenarios when handling the configuration file that may be required by the provisioning script made Provisioner.configure() overly complex. I decided to limit its scope so it just handles two basic use cases: 1. The configuration file does not exists and needs to be created from scratch based on the provided BPS configuration. 2. The configuration file already exists and will be used as is. If the user wishes to recreate the configuration file for the provisioning script based on new/updated settings, they will need to manually delete the existing one.
Changed ``bps report`` behavior so it will display the information related to the resource provisioning only when it was enabled in the BPS configuration.
Added a new section to the existing documenatation that describes how to enable automatic provisioning of the resources.
``ctrl_execute`` is a part of the **lsst_distrib** from some time. Removed the note which claimed otherwise.
The search options necessary to find settings for the provisioning job were defined outside of the Provisioner class. Made them part of the class with on option to override them in necessary.
The methods of the Provisioner class need to be called in certain order, but there were no safeguards to ensure it. Added them.
A wrong exception type was used in the safeguard that was supposed to catch any issues with writing the provisioning script config. Fixed it.
Fixed some typos, reworded selected comments, and renamed a variable to make code easier to read.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #44 +/- ##
==========================================
+ Coverage 57.41% 60.86% +3.44%
==========================================
Files 8 10 +2
Lines 2078 2279 +201
Branches 366 400 +34
==========================================
+ Hits 1193 1387 +194
- Misses 843 845 +2
- Partials 42 47 +5 ☔ View full report in Codecov by Sentry. |
There was not an easy way to include custom command line options when calling the provisioning script. Added a new configuration option to address this limitation.
654656b
to
a5f603d
Compare
a5f603d
to
9f5f8a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments. Merge approved
When DAGMan deletes the service job on its exits this job's status is set to DELETED. For this reason to avoid confusing users the provisioning job status was shown by ``bps report `` only while the workflow was running. As a result there is no way to easily know if the run was using automatic provisioning of the resources after it is finished. Made changes to always report the provisioning job status, but if the provisioning job was deleted by DAGMan its final status is changed to SUCCEEDED and displayed as such.
934e213
to
252429e
Compare
Fixed a type descriptions in _get_state_counts_from_jobs() docstring and rename variable to follow the naming convention used in other functions.
976e511
to
058711b
Compare
Checklist
doc/changes