Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote PBS constants to optional parameters #121

Conversation

SeanBryan51
Copy link
Collaborator

This change promotes PBS related constants to optional parameters in the configuration file so that PBS flags can be set at runtime.

This is useful in running multiple benchcab instances with different job parameters such as memory and the number of CPUs. This will also allow us to easily find an optimal number of CPUs to use to maximise performance.

This change also adds the ability to switch on and off multiprocessing at runtime via an optional parameter in the config file.

Fixes #104

@SeanBryan51 SeanBryan51 linked an issue Jul 28, 2023 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jul 28, 2023

Codecov Report

Merging #121 (309adca) into master (b5c2169) will decrease coverage by 0.19%.
Report is 1 commits behind head on master.
The diff coverage is 81.18%.

@@            Coverage Diff             @@
##           master     #121      +/-   ##
==========================================
- Coverage   88.44%   88.26%   -0.19%     
==========================================
  Files          27       27              
  Lines        1385     1474      +89     
==========================================
+ Hits         1225     1301      +76     
- Misses        160      173      +13     
Files Changed Coverage Δ
benchcab/benchcab.py 33.91% <0.00%> (-2.80%) ⬇️
benchcab/comparison.py 61.36% <0.00%> (ø)
tests/common.py 88.57% <ø> (ø)
benchcab/task.py 83.00% <50.00%> (ø)
benchcab/bench_config.py 100.00% <100.00%> (ø)
benchcab/internal.py 90.24% <100.00%> (-0.24%) ⬇️
benchcab/utils/pbs.py 100.00% <100.00%> (ø)
tests/test_bench_config.py 100.00% <100.00%> (ø)
tests/test_pbs.py 100.00% <100.00%> (ø)

@SeanBryan51 SeanBryan51 force-pushed the 104-promote-pbs-related-constants-to-parameters-in-the-config-file branch from 08c04bb to 4d07515 Compare July 28, 2023 05:34
@SeanBryan51
Copy link
Collaborator Author

SeanBryan51 commented Jul 31, 2023

Integration tests

  1. Test multiprocessing set to False runs fluxsite tasks in serial:
#!/bin/bash
bench_example_dir='bench_example_test_multiprocessing_param'
rm -rf $bench_example_dir
git clone [email protected]:CABLE-LSM/bench_example.git $bench_example_dir
cd $bench_example_dir
git reset --hard 6287539e96fc8ef36dc578201fbf9847314147fb
sed -i 's/project:.*/project: tm70/g' config.yaml
sed -i 's/experiment:.*/experiment: AU-Tum/g' config.yaml
sed -i 's/ccc561\/v3.0-YP-changes/sb8430\/test-branch/g' config.yaml
echo "
fluxsite:
  multiprocess: False
" >> config.yaml
echo "
science_configurations:
  - cable:
      cable_user:
        CONSISTENCY_CHECK: False
" >> config.yaml
benchcab run -v

Job standard output:

Running fluxsite tasks...
Running task AU-Tum_2002-2017_OzFlux_Met_R0_S0... CABLE standard output saved in /scratch/tm70/sb8430/benchcab_integration_tests/bench_example_test_multiprocessing_param/runs/fluxsite/tasks/AU-Tum_2002-2017_OzFlux_Met_R0_S0/out.txt
./cable cable.nml
Adding attributes to output file: /scratch/tm70/sb8430/benchcab_integration_tests/bench_example_test_multiprocessing_param/runs/fluxsite/outputs/AU-Tum_2002-2017_OzFlux_Met_R0_S0_out.nc
Running task AU-Tum_2002-2017_OzFlux_Met_R1_S0... CABLE standard output saved in /scratch/tm70/sb8430/benchcab_integration_tests/bench_example_test_multiprocessing_param/runs/fluxsite/tasks/AU-Tum_2002-2017_OzFlux_Met_R1_S0/out.txt
./cable cable.nml
Adding attributes to output file: /scratch/tm70/sb8430/benchcab_integration_tests/bench_example_test_multiprocessing_param/runs/fluxsite/outputs/AU-Tum_2002-2017_OzFlux_Met_R1_S0_out.nc
Successfully ran fluxsite tasks

Running comparison tasks...
Comparing files AU-Tum_2002-2017_OzFlux_Met_R0_S0_out.nc and AU-Tum_2002-2017_OzFlux_Met_R1_S0_out.nc bitwise...
nccmp -df /scratch/tm70/sb8430/benchcab_integration_tests/bench_example_test_multiprocessing_param/runs/fluxsite/outputs/AU-Tum_2002-2017_OzFlux_Met_R0_S0_out.nc /scratch/tm70/sb8430/benchcab_integration_tests/bench_example_test_multiprocessing_param/runs/fluxsite/outputs/AU-Tum_2002-2017_OzFlux_Met_R1_S0_out.nc
Success: files AU-Tum_2002-2017_OzFlux_Met_R0_S0_out.nc AU-Tum_2002-2017_OzFlux_Met_R1_S0_out.nc are identical
Successfully ran comparison tasks

======================================================================================
                  Resource Usage on 2023-08-07 20:45:40:
   Job Id:             92144995.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      1.79
   NCPUs Requested:    18                     NCPUs Used: 18
                                           CPU Time Used: 00:02:54
   Memory Requested:   30.0GB                Memory Used: 199.62MB
   Walltime requested: 06:00:00            Walltime Used: 00:02:59
   JobFS requested:    100.0MB                JobFS used: 0B
======================================================================================
  1. Test ncpus, mem and walltime set the PBS flags correctly in the job script:
#!/bin/bash
bench_example_dir='bench_example_test_pbs_params'
rm -rf $bench_example_dir
git clone [email protected]:CABLE-LSM/bench_example.git $bench_example_dir
cd $bench_example_dir
git reset --hard 6287539e96fc8ef36dc578201fbf9847314147fb
sed -i 's/project:.*/project: tm70/g' config.yaml
sed -i 's/ccc561\/v3.0-YP-changes/sb8430\/test-branch/g' config.yaml
echo "
fluxsite:
  pbs:
    ncpus: 20
    mem: 20GB
    walltime: 01:00:00
" >> config.yaml
benchcab run -v

Job summary:

======================================================================================
                  Resource Usage on 2023-07-31 12:58:25:
   Job Id:             91614874.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      4.12
   NCPUs Requested:    20                     NCPUs Used: 20
                                           CPU Time Used: 01:28:51
   Memory Requested:   20.0GB                Memory Used: 3.44GB
   Walltime requested: 01:00:00            Walltime Used: 00:06:11
   JobFS requested:    100.0MB                JobFS used: 0B
======================================================================================

Copy link
Collaborator

@ccarouge ccarouge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Just thinking we might need options to differentiate between fluxsite and spatial.

docs/user_guide/config_options.md Outdated Show resolved Hide resolved
docs/user_guide/config_options.md Outdated Show resolved Hide resolved
Comment on lines 46 to 47
# the "pbs" key is optional
if "pbs" in config:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need 2 PBS keys: one for fluxsite and one for spatial as the resources are likely going to be quite different.

So should we rename "pbs" to "pbs_fluxsite" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. I'm thinking we go with this structure:

fluxsite:
  pbs:
    ncpus: 16
    ...
  multiprocessing: True
spatial:
  pbs:
    ncpus: 16
    ...

mainly because multiprocessing is probably specific to only fluxsite tests. My guess is we won't be running multiple cable MPI executables in parallel on a single PBS job like we do with the fluxsite case (e.g. if we use payu run to execute each spatial task, we would only be able to run one task per PBS job). Although it might be possible to run multiple MPI instances on a single PBS job (see here for example).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks good.

docs/user_guide/config_options.md Outdated Show resolved Hide resolved
docs/user_guide/config_options.md Outdated Show resolved Hide resolved
docs/user_guide/config_options.md Outdated Show resolved Hide resolved
Comment on lines 68 to 69
# the "multiprocessing" key is optional
if "multiprocessing" in config and not isinstance(config["multiprocessing"], bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, do we need 2: one for fluxsite and one for spatial? Not sure here, so we could leave as is for now.

Copy link
Collaborator Author

@SeanBryan51 SeanBryan51 Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per reply above

multiprocessing is probably specific to only fluxsite tests. My guess is we won't be running multiple cable MPI executables in parallel on a single PBS job like we do with the fluxsite case (e.g. if we use payu run to execute each spatial task, we would only be able to run one task per PBS job). Although it might be possible to run multiple MPI instances on a single PBS job (see here for example).

NCPUS = 18
MEM = "30GB"
WALL_TIME = "6:00:00"
DEFAULT_PBS: Any = {"ncpus": 18, "mem": "30GB", "walltime": "6:00:00", "storage": []}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need 2, one for fluxsite and one for spatial.

SeanBryan51 and others added 6 commits August 7, 2023 15:51
This change promotes PBS related constants to optional parameters in the
configuration file so that PBS flags can be set at runtime.

This is useful in running multiple benchcab instances with different job
parameters such as memory and the number of CPUs. This will also allow
us to easily find an optimal number of CPUs to use to maximise
performance.

This change also adds the ability to switch on and off multiprocessing
at runtime via an optional parameter in the config file.

Fixes #104
@SeanBryan51 SeanBryan51 force-pushed the 104-promote-pbs-related-constants-to-parameters-in-the-config-file branch 2 times, most recently from 06a94be to d5ba907 Compare August 7, 2023 06:34
@SeanBryan51 SeanBryan51 force-pushed the 104-promote-pbs-related-constants-to-parameters-in-the-config-file branch from d5ba907 to 03a724c Compare August 7, 2023 10:05
In config.yaml options, 'multiprocessing' should be 'multiprocess'.
Copy link
Collaborator

@ccarouge ccarouge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of suggestions for the documentation to specify fluxsite and pbs are optional.

docs/user_guide/config_options.md Outdated Show resolved Hide resolved
docs/user_guide/config_options.md Outdated Show resolved Hide resolved
@SeanBryan51 SeanBryan51 merged commit a35753f into master Aug 10, 2023
1 of 3 checks passed
@SeanBryan51 SeanBryan51 deleted the 104-promote-pbs-related-constants-to-parameters-in-the-config-file branch August 10, 2023 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Promote PBS related constants to parameters in the config file
2 participants