Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for torque in IPMU #36

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mr-superonion
Copy link

The PR adds support for torque and makes sure it can be run on servers at IPMU (for HSC project).

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

return "torque"


class TorqueProviderI(TorqueProvider):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in what this subclass is for - it looks like you're trying to add a tasks-per-node parameter which would usually end up launching multiple copies of the Parsl worker pool on one node (rather than having one process worker pool manage the whole node). Is this what you're intending / is this actually what happens?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pasting the submission script generated by parsl:

#!/bin/bash

#PBS -N shear.test
#PBS -q small

#PBS -S /bin/bash
#PBS -N parsl.parsl.torque.block-0.1726231023.145446
#PBS -m n
#PBS -l walltime=10:00:00
#PBS -l nodes=2:ppn=12
#PBS -o /work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/submit_scripts/parsl.parsl.torque.block-0.1726231023.145446.submit.stdout
#PBS -e /work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/submit_scripts/parsl.parsl.torque.block-0.1726231023.145446.submit.stderr

source /work/xiangchong.li/setupIm.sh

export JOBNAME="parsl.parsl.torque.block-0.1726231023.145446"

set -e
export CORES=$(getconf _NPROCESSORS_ONLN)
[[ "1" == "1" ]] && echo "Found cores : $CORES"
WORKERCOUNT=24

cat << MPIRUN_EOF > cmd_$JOBNAME.sh
process_worker_pool.py   -a gw2.local -p 0 -c 1.0 -m None --poll 10 --task_port=54319 --result_port=54758 --logdir=/work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/torque --block_id=0 --hb_period=30  --hb_threshold=120 --cpu-affinity none --available-accelerators  --start-method spawn
MPIRUN_EOF
chmod u+x cmd_$JOBNAME.sh

mpirun -np $WORKERCOUNT  /bin/bash cmd_$JOBNAME.sh

[[ "1" == "1" ]] && echo "All workers done"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this subclass so that I can change the ppn parameters in PBS system by setting the tasks_per_node in the configure file. The goal is to use 12 cpus in each node and each cpu has one task.

I am not sure I am doing it in the best way, but the code can be run on the server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the usual Parsl model, you'd run one copy of process_worker_pool.py on each node, and that worker pool would be in charge of running multiple tasks at once. The command line you specify has an option -c 1.0 which means 1 core per worker.

So the worker pool code should run as many workers (and so, as many simultaneous tasks) as you have cores on your worker node: that is the code that is in charge of running multiple workers, not mpirun.

Have a look in your run directory (deep inside runinfo/....) for a file called manager.log. You should see one per node (or with your configuration above, 24 per node) and inside those files you should see a log line like this:

2024-09-13 12:54:44.837 parsl:254 72 MainThread [INFO]  Manager will spawn 8 workers

How many workers do you see there?

Copy link
Author

@mr-superonion mr-superonion Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should see one per node (or with your configuration above, 24 per node)

I think in the submission code generated by parsl, I run 12 tasks per node, each with one cpu, and there are 24 tasks over 2 ndoes.
#PBS -l nodes=2:ppn=12 says each node uses 12 cores (therefore 12 tasks running at the same time).
while the WORKERCOUNT=24 says each node has 24 workers across all the nodes.

Copy link
Author

@mr-superonion mr-superonion Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that slurm is doing in a more consistent way, I guess:
https://github.com/Parsl/parsl/blob/dd9150d7ac26b04eb8ff15247b1c18ce9893f79c/parsl/providers/slurm/slurm.py#L266

It has the option to set cores_per_task in addition to tasks_per_node.. PBS does not has this option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your setup, i think you should make each worker pool try to use only 1 core, so that when you run 12 worker pools per node, you get 1 x 12 = 12 workers on each node. Have a look at the max_workers Parsl configuration parameter - for example, see how it is configured at in2p3:

Yeah.. Got it now. Thanks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are lots of different ways to change things to get what you want, so it is quite confusing.

You could try this:

i) set the number of nodes in your job to 1 (so if you want to run on multiple nodes, you launch multiple blocks/multiple batch jobs)

ii) use the change you have made in this PR to set task_per_node to 12 - so that 12 cores are requested in #PBS nodes=...

iii) use the SimpleLauncher instead of the MpiRunLauncher here:

https://github.com/lsst/ctrl_bps_parsl/pull/36/files#diff-e5ba88552b57b323bd184a741f622b7cc7b3a4090d5ac09456f7a8fe85fcc75cR287

so that only a single copy of the process worker pool is launched in each batch job - rather than using mpirun to launch many copies of it

iv) tell the process worker pool to use 12 workers per pool, using max_workers = 12.

That should result in batch jobs where each batch job:

  • gets 12 cores from PBS
  • runs one copy of the Parsl process worker pool
  • the process worker pool runs 12 workers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Does the SimpleLauncher support running on two nodes? I thought if I use two nodes, I have to have 2 copies, one on each node. And I thought the copy shall be done with MpiRunLauncher? Please correct me if you find this understanding is wrong.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleLauncher does not support running on two nodes.

The model I wrote above has 1 node per block/per batch job - and if you want to use two nodes, set the max_blocks paramter to 2. so that you get two separate batch jobs that look like this.

(I opened a Parsl issue Parsl/parsl#3616 to request that the Parsl team try to make this interface nicer, some time in the future)

@mr-superonion
Copy link
Author

mr-superonion commented Sep 14, 2024

Is there anything else I need to do to finish the PR?

@benclifford
Copy link

Is there anything else I need to do to finish the PR?

@mr-superonion that's probably not for me to say - I was mostly interested in understanding what is missing in Parsl to make this so complicated, and I think I have got that information now in Parsl/parsl#3616 and Parsl/parsl#3617

@PaulPrice
Copy link
Contributor

Hi @mr-superonion . I've seen this, am grateful for your contribution, and will work on getting it incorporated. There are some hoops that I've got to jump through.
Thanks also to @benclifford for his expert help with Parsl.

benclifford added a commit to Parsl/parslguts that referenced this pull request Sep 17, 2024
@mr-superonion
Copy link
Author

mr-superonion commented Oct 3, 2024

Sorry that the code before actually does not work. The system put all the pools in one node and the other nodes basically do nothing. now I change the code so that it creates num_nodes * tasks_per_node pools and each pool has one worker... It seems that this is the only way to make it work for multiple nodes in torque?

@benclifford , please let me know if this setup does not make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants