add support for torque in IPMU #36

mr-superonion · 2024-09-13T00:14:43Z

The PR adds support for torque and makes sure it can be run on servers at IPMU (for HSC project).

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

benclifford · 2024-09-13T10:54:28Z

python/lsst/ctrl/bps/parsl/sites/torque.py

+        return "torque"
+
+
+class TorqueProviderI(TorqueProvider):


I'm interested in what this subclass is for - it looks like you're trying to add a tasks-per-node parameter which would usually end up launching multiple copies of the Parsl worker pool on one node (rather than having one process worker pool manage the whole node). Is this what you're intending / is this actually what happens?

I am pasting the submission script generated by parsl:

#!/bin/bash #PBS -N shear.test #PBS -q small #PBS -S /bin/bash #PBS -N parsl.parsl.torque.block-0.1726231023.145446 #PBS -m n #PBS -l walltime=10:00:00 #PBS -l nodes=2:ppn=12 #PBS -o /work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/submit_scripts/parsl.parsl.torque.block-0.1726231023.145446.submit.stdout #PBS -e /work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/submit_scripts/parsl.parsl.torque.block-0.1726231023.145446.submit.stderr source /work/xiangchong.li/setupIm.sh export JOBNAME="parsl.parsl.torque.block-0.1726231023.145446" set -e export CORES=$(getconf _NPROCESSORS_ONLN) [[ "1" == "1" ]] && echo "Found cores : $CORES" WORKERCOUNT=24 cat << MPIRUN_EOF > cmd_$JOBNAME.sh process_worker_pool.py -a gw2.local -p 0 -c 1.0 -m None --poll 10 --task_port=54319 --result_port=54758 --logdir=/work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/torque --block_id=0 --hb_period=30 --hb_threshold=120 --cpu-affinity none --available-accelerators --start-method spawn MPIRUN_EOF chmod u+x cmd_$JOBNAME.sh mpirun -np $WORKERCOUNT /bin/bash cmd_$JOBNAME.sh [[ "1" == "1" ]] && echo "All workers done"

I add this subclass so that I can change the ppn parameters in PBS system by setting the tasks_per_node in the configure file. The goal is to use 12 cpus in each node and each cpu has one task.

I am not sure I am doing it in the best way, but the code can be run on the server.

In the usual Parsl model, you'd run one copy of process_worker_pool.py on each node, and that worker pool would be in charge of running multiple tasks at once. The command line you specify has an option -c 1.0 which means 1 core per worker.

So the worker pool code should run as many workers (and so, as many simultaneous tasks) as you have cores on your worker node: that is the code that is in charge of running multiple workers, not mpirun.

Have a look in your run directory (deep inside runinfo/....) for a file called manager.log. You should see one per node (or with your configuration above, 24 per node) and inside those files you should see a log line like this:

2024-09-13 12:54:44.837 parsl:254 72 MainThread [INFO] Manager will spawn 8 workers

How many workers do you see there?

You should see one per node (or with your configuration above, 24 per node)

I think in the submission code generated by parsl, I run 12 tasks per node, each with one cpu, and there are 24 tasks over 2 ndoes.
#PBS -l nodes=2:ppn=12 says each node uses 12 cores (therefore 12 tasks running at the same time).
while the WORKERCOUNT=24 says each node has 24 workers across all the nodes.

Note that slurm is doing in a more consistent way, I guess:
https://github.com/Parsl/parsl/blob/dd9150d7ac26b04eb8ff15247b1c18ce9893f79c/parsl/providers/slurm/slurm.py#L266

It has the option to set cores_per_task in addition to tasks_per_node.. PBS does not has this option.

In your setup, i think you should make each worker pool try to use only 1 core, so that when you run 12 worker pools per node, you get 1 x 12 = 12 workers on each node. Have a look at the max_workers Parsl configuration parameter - for example, see how it is configured at in2p3:

Yeah.. Got it now. Thanks

there are lots of different ways to change things to get what you want, so it is quite confusing.

You could try this:

i) set the number of nodes in your job to 1 (so if you want to run on multiple nodes, you launch multiple blocks/multiple batch jobs)

ii) use the change you have made in this PR to set task_per_node to 12 - so that 12 cores are requested in #PBS nodes=...

iii) use the SimpleLauncher instead of the MpiRunLauncher here:

https://github.com/lsst/ctrl_bps_parsl/pull/36/files#diff-e5ba88552b57b323bd184a741f622b7cc7b3a4090d5ac09456f7a8fe85fcc75cR287

so that only a single copy of the process worker pool is launched in each batch job - rather than using mpirun to launch many copies of it

iv) tell the process worker pool to use 12 workers per pool, using max_workers = 12.

That should result in batch jobs where each batch job:

gets 12 cores from PBS

runs one copy of the Parsl process worker pool

the process worker pool runs 12 workers

Thanks. Does the SimpleLauncher support running on two nodes? I thought if I use two nodes, I have to have 2 copies, one on each node. And I thought the copy shall be done with MpiRunLauncher? Please correct me if you find this understanding is wrong.

SimpleLauncher does not support running on two nodes.

The model I wrote above has 1 node per block/per batch job - and if you want to use two nodes, set the max_blocks paramter to 2. so that you get two separate batch jobs that look like this.

(I opened a Parsl issue Parsl/parsl#3616 to request that the Parsl team try to make this interface nicer, some time in the future)

mr-superonion · 2024-09-14T03:35:28Z

Is there anything else I need to do to finish the PR?

benclifford · 2024-09-16T07:40:57Z

Is there anything else I need to do to finish the PR?

@mr-superonion that's probably not for me to say - I was mostly interested in understanding what is missing in Parsl to make this so complicated, and I think I have got that information now in Parsl/parsl#3616 and Parsl/parsl#3617

PaulPrice · 2024-09-16T15:51:44Z

Hi @mr-superonion . I've seen this, am grateful for your contribution, and will work on getting it incorporated. There are some hoops that I've got to jump through.
Thanks also to @benclifford for his expert help with Parsl.

…ich made me want to clarify this

mr-superonion · 2024-10-03T20:04:11Z

Sorry that the code before actually does not work. The system put all the pools in one node and the other nodes basically do nothing. now I change the code so that it creates num_nodes * tasks_per_node pools and each pool has one worker... It seems that this is the only way to make it work for multiple nodes in torque?

@benclifford , please let me know if this setup does not make sense.

PaulPrice · 2025-01-22T20:29:27Z

I imported the code next to some other changes I was working on, and cleaned it up. @mr-superonion: would you like to test the changes before we merge?

PaulPrice · 2025-01-22T21:01:55Z

Cleaned up and incorporated into #41.

add support for torque in IPMU

efb532d

benclifford reviewed Sep 13, 2024

View reviewed changes

benclifford mentioned this pull request Sep 13, 2024

pbs provider configurability vs partial-allocated nodes Parsl/parsl#3616

Open

Xiangchong Li added 2 commits September 14, 2024 03:13

one parsl manager one node

79b28c6

fix the number of worker to be the number of tasks

480f4e6

benclifford added a commit to Parsl/parslguts that referenced this pull request Sep 17, 2024

Add a note on launchers from active lsst PR lsst/ctrl_bps_parsl#36 wh…

0da3f09

…ich made me want to clarify this

have to use max_worker=1

fbc30f0

PaulPrice closed this Jan 22, 2025

add support for torque in IPMU #36

add support for torque in IPMU #36

Uh oh!

Conversation

mr-superonion commented Sep 13, 2024

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mr-superonion Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mr-superonion Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mr-superonion commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benclifford commented Sep 16, 2024

Uh oh!

PaulPrice commented Sep 16, 2024

Uh oh!

mr-superonion commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaulPrice commented Jan 22, 2025

Uh oh!

PaulPrice commented Jan 22, 2025

Uh oh!

Uh oh!

mr-superonion Sep 13, 2024 •

edited

Loading

mr-superonion Sep 13, 2024 •

edited

Loading

mr-superonion commented Sep 14, 2024 •

edited

Loading

mr-superonion commented Oct 3, 2024 •

edited

Loading