Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration file for Graham (#169) #172

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

bouthilx
Copy link
Collaborator

@bouthilx bouthilx commented Oct 16, 2017

Wait for #166 before merging
Fix issue #169

aalitaiga and others added 30 commits July 31, 2017 15:36
Why:

For each option, add_sbatch_option would add the option in both the form
--[OPTION_NAME] and [OPTION_NAME].
It will need many conversions, not only on resources, so better make it
clean.
Slurm has no queues, so PBS option -q is invalid and non-convertible.
$PBS_JOBID was used to set the stdout/err of the job as well as in the
commands. Replace them with $SLURM_JOB_ID.

Also, workers were accessing os.environ[PBS_JOBID] so we added a second
fetch on SLURM_JOB_ID in case os.environ[PBS_JOBID] gave undefined.
Slurm cannot be passed environment variables defined locally on
command-line like PBS_FILENAME is. To bypass this, we add a definition
in the prolog, making PBS_FILENAME available to all commands and epilog.

NOTE: We leave PBS_FILENAME definition in command-line too such that any
user using $PBS_FILENAME inside a custom pbsFlag can still do so.
PBS options -V is not converted properly to SBATCH --export ALL.
We remove it and replace it with --export=ALL is the sbatch options.
Slurm does not have a equivalent environment variable set like
PBS_WALLTIME. To avoid confusion, all variables PBS_WALLTIME are renamed
to SBATCH_TIMELIMIT (the environment variable one would use to set --time
with sbatch). As SBATCH_TIMELIMIT is not set automatically, we add it to
the prolog to make it available to all commands and epilog.

NOTE: PBS_WALLTIME is set in seconds, but we only have HH:MM:SS-like strings
at the time of building the PBS file. We needed to add a
walltime_to_seconds helper function to convert HH:MM:SS like strings
into seconds, so that SBATCH_TIMELIMIT is set with seconds like
PBS_WALLTIME.
It is possible to query the system to see if some commands are available
using distutils.spawn.find_executable(command_name).

Clusters where more than one launcher are available will still get
launchers selected based on string matching. For instance,
get_launcher("helios") would always return msub no matter what is
available on the system.
It is difficult to debug resuming while important process are taking place in
the pbs script automatically built by SmartDispatch.

We add verbose to smart-dispatch script and add debugging prints in epilog.
JobGenerators are selected by job_generator_factory based on the
cluster's name. We use a more flexible, duck typing approach for Slurm
clusters. If cluster name is not known, or not any of the if-case
clauses in the factory, then we look at which launchers are available in
the system. If it is sbatch, then a SlurmJobGenerator is built,
a JobGenerator otherwise.
The command `sacctmgr` fails on some computers (mila01 namely), but the
current behavior gives the impression sbatch is simply not available.

Printing the stderr makes it more obvious that sbatch should be
available, but something is broken behind sacctmgr. It only appears when
using -vv options nevertheless.
Adding a script to do automatic verifications to assert validity of the
current code.

The verifications are not automatic unit-tests, they need automatically
checks that the process executed successfully, but the administrator
still needs to verify manually, reading the logs, that the requested
resources were provided.

Verifications can easily be combined, building on top of each others,
from complex ones to simpler ones.

Here is a list of all the verification currently implemented for slurm
clusters:

 1. very_simple_task                              (1  CPU)
 2. verify_simple_task_with_one_gpu               (1  CPU 1 GPU)
 3. verify_simple_task_with_many_gpus             (1  CPU X GPU)
 4. verify_many_task                              (X  CPU)
 5. verify_many_task_with_many_cores              (XY CPU)
 6. verify_many_task_with_one_gpu                 (X CPU X GPU)
 7. verify_many_task_with_many_gpus               (X CPU Y GPU)
 8. verify_simple_task_with_autoresume_unneeded   (1 CPU)
 9. verify_simple_task_with_autoresume_needed     (1 CPU)
10. verify_many_task_with_autoresume_needed       (X CPU)
bouthilx and others added 4 commits October 16, 2017 10:27
My initial though was that get_launcher should raise an error when no
launcher is found on the system since there cannot be any job launcher.
I realized that this would break the --doNotLaunch option that users may
want to use on system with no launcher, just to create the files.
The tests were failing because the account was not specified.
The tests were failing because the account was not specified
@coveralls
Copy link

Coverage Status

Coverage increased (+1.2%) to 95.725% when pulling 090309b on bouthilx:graham_config into 9133e15 on SMART-Lab:master.

@nouiz
Copy link
Collaborator

nouiz commented Oct 17, 2017

Does this completly fix #169 ?

@bouthilx
Copy link
Collaborator Author

@nouiz It's a branch forked from adrien_slurm, the one PR #166 is based upon. Yes it should. The question is more: are those configurations the one we should define or not. We don't have strict queues like torque clusters, so it's up to us to decide how we define/devide queues.

There was a missing parentheses which was causing a bad conversion of
"DD:HH:MM:SS" to seconds.

The unit-test was also missing the same parentheses. I added a unit-test
to make sure such error could not occur again.
@coveralls
Copy link

Coverage Status

Coverage increased (+1.2%) to 95.738% when pulling cac2f08 on bouthilx:graham_config into 9133e15 on SMART-Lab:master.

@bouthilx bouthilx changed the title Add configuration file for Graham Add configuration file for Graham (#169) Oct 18, 2017
### SLURM clusters

Smartdispatch can also run on SLURM clusters.
All features like `--gpusPerNode` or `--coresPerNode` are supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All features including the ones like ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants