Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for node resource hints in provider #942

Open
yadudoc opened this issue May 14, 2019 · 14 comments
Open

Support for node resource hints in provider #942

yadudoc opened this issue May 14, 2019 · 14 comments

Comments

@yadudoc
Copy link
Member

yadudoc commented May 14, 2019

With IPP we had to explicitly specify the number of workers to be launched per node, since the launch system doesn't necessarily run on the compute node and lacked information about the available resource before hand. As we developed HTEX, we moved to an architecture where we might use a launch system that starts 1 manager per node, and this manager after probing available cpu/mem/ resources on the node launches workers based on per-worker resource limits set by the user. This only helps once we have a node provisioned and doesn't help in the strategy planning phase where resource information is still unavailable and we assume the worst case, only 1 worker can run per node, leading to gross overestimation of resources required. One potential automatic method would be to launch a single block and wait until resource information is available before attempting to scale further, but this might work poorly with slow queues. This would definitely take more effort, and is better done with the plan to revamp the strategy component.

When all else fails, we could fallback to the user to specify resource hints to the provider :

HighThroughputExecutor(
            label="htex",
            worker_debug=True,
            cores_per_worker=2,
            mem_per_worker=2,        # 2GB
            provider=SlurmProvider(
                'debug',
                nodes_per_block=2,
                cpus_per_node=64,         # <-- New
                mem_per_node=128,         # <-- New
            )
)

This is a MolSSI/QCArchive requirement (@Lnaden)

@annawoodard
Copy link
Collaborator

Note already have cpus_per_node in the PBSProProvider. Also note we probably want to keep #943 in mind as we implement this because they are closely related.

@yadudoc
Copy link
Member Author

yadudoc commented Jul 18, 2019

This is high priority for QCArchive.

@yadudoc
Copy link
Member Author

yadudoc commented Jul 23, 2019

The quickest way to go about this would be to specify exactly the resource slice you need from the scheduler, via Parsl's provider.scheduler_options kwarg. This would be followed by the provider.worker_init kwarg which would export a few global variables that Parsl can use. We need to add support for using these exported options in the Parsl worker code. I'll add a separate issue for that.

Here's a sample config that would work after we've got the bits implemented :

HighThroughputExecutor(
    label="htex",
    cores_per_worker=2,
    mem_per_worker=2,        # 2GB
    provider=SlurmProvider(
          'debug',
          nodes_per_block=2,
          scheduler_options = '#SBATCH --cpus-per-task=2 --mem-per-cpu=1g --ntasks=1',
          worker_init='export PARSL_MAX_MEMORY=2G; export PARSL_MAX_CPUS=2',
     )
)

@Lnaden, @dgasmith Could you take a look please ?

@dgasmith
Copy link
Contributor

It would be good if the scheduler options where taken care of automatically by the general cores_per_worker. Without this things could get out of sync and wind up with weird errors. Something like dask-jobqueue handles it this way, I can dig up the templates if you want.

@annawoodard annawoodard assigned annawoodard and yadudoc and unassigned yadudoc Jul 30, 2019
annawoodard added a commit that referenced this issue Aug 22, 2019
This commit adds support for respecting specification of how much CPU
and memory should be used by workers via the PARSL_CORES and
PARSL_GB_MEMORY environmental variables (as opposed to inferring it
to be the full node, once a job has started on the node.) After this commit, the config
below should work to submit two two-core workers.

This is a pre-requisite to a full solution to #942.

```
from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.addresses import address_by_hostname
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            cores_per_worker=2,
            mem_per_worker=2,
            address=address_by_hostname(),
            provider=SlurmProvider(
                'broadwl',
                cmd_timeout=60,
                nodes_per_block=1,
                init_blocks=1,
                min_blocks=1,
                max_blocks=1,
                scheduler_options='#SBATCH --cpus-per-task=2 --mem-per-cpu=1g,
                worker_init='export PARSL_GB_MEMORY=4; export PARSL_CORES=4',
                exclusive=False
            ),
        )
    ],
)
```
annawoodard added a commit that referenced this issue Aug 22, 2019
This commit adds support for respecting specification of how much CPU
and memory should be used by workers via the PARSL_CORES and
PARSL_GB_MEMORY environmental variables (as opposed to inferring it
to be the full node, once a job has started on the node.) After this commit, the config
below will result in 4 workers being started, using a total of 4 cores
and 12 GB memory.

This is a pre-requisite to a full solution to #942.

```
from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.addresses import address_by_hostname
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            cores_per_worker=1,
            mem_per_worker=3,
            address=address_by_hostname(),
            provider=SlurmProvider(
                'broadwl',
                cmd_timeout=60,
                nodes_per_block=1,
                init_blocks=1,
                min_blocks=1,
                max_blocks=1,
                scheduler_options='#SBATCH --cpus-per-task=4 --mem-per-cpu=3g,
                worker_init='worker_init='export PARSL_GB_MEMORY=$(expr $SLURM_CPUS_ON_NODE \* $SLURM_MEM_PER_CPU); export PARSL_CORES=$SLURM_CPUS_ON_NODE',
                exclusive=False
            ),
        )
    ],
)
```
annawoodard added a commit that referenced this issue Aug 23, 2019
annawoodard added a commit that referenced this issue Aug 23, 2019
* Support cpu and mem specification via env

This commit adds support for respecting specification of how much CPU
and memory should be used by workers via the PARSL_CORES and
PARSL_GB_MEMORY environmental variables (as opposed to inferring it
to be the full node, once a job has started on the node.) After this commit, the config
below will result in 4 workers being started, using a total of 4 cores
and 12 GB memory.

This is a pre-requisite to a full solution to #942.

```
from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.addresses import address_by_hostname
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            cores_per_worker=1,
            mem_per_worker=3,
            address=address_by_hostname(),
            provider=SlurmProvider(
                'broadwl',
                cmd_timeout=60,
                nodes_per_block=1,
                init_blocks=1,
                min_blocks=1,
                max_blocks=1,
                scheduler_options='#SBATCH --cpus-per-task=4 --mem-per-cpu=3g,
                worker_init='worker_init='export PARSL_MEMORY_GB=$(expr $SLURM_CPUS_ON_NODE \* $SLURM_MEM_PER_CPU); export PARSL_CORES=$SLURM_CPUS_ON_NODE',
                exclusive=False
            ),
        )
    ],
)
```

* Change PARSL_GB_MEMORY -> PARSL_MEMORY_GB

* Revert changes to launchers

This reversts the launcher changes introduced in
18061d2.
annawoodard added a commit that referenced this issue Aug 23, 2019
This commit adds the `cores_per_node` and `mem_per_node` keyword args to
the SlurmProvider. These default to None, and behavior is not modified
in the default case. Setting either has three effects. First, it modifies
the Slurm submit script to request the appropriate cores and/or memory.
Second, it sets the environment variables `PARSL_MEMORY_GB` and
`PARSL_CORES` on the node. Finally, the HighThroughputExecutor is modified to
respect those environment variables if they are set, instead of assuming
it has the entire node available for starting workers. An example
configuration, tested on Midway, is provided below. This configuration
requests 4 1-core workers, each with 3 GB of memory.

Partially addresses #942.

```
from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.addresses import address_by_hostname
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            cores_per_worker=1,
            mem_per_worker=3,
            address=address_by_hostname(),
            provider=SlurmProvider(
                'broadwl',
                nodes_per_block=1,
                init_blocks=1,
                min_blocks=1,
                max_blocks=1,
                mem_per_node=12,
                cores_per_node=4,
                exclusive=False
            ),
        )
    ],
)
```
annawoodard added a commit that referenced this issue Aug 23, 2019
Partially addresses #942.

This commit adds the `cores_per_node` and `mem_per_node` keyword args to
the SlurmProvider. These default to None, and behavior is not modified
in the default case. Setting either has three effects. First, it
modifies the Slurm submit script to request the appropriate cores and/or
memory.  Second, it sets the environment variables `PARSL_MEMORY_GB` and
`PARSL_CORES` on the node. Finally, the `workers_per_node` attribute is
added to the `HighThroughputExecutor` which will be calculated according
to the resource hints, if they are available. This is read by the
strategy piece, enabling a more accurate calculation for scaling
resources up and down. An example configuration, tested on Midway, is
provided below. This configuration requests 4 1-core workers, each with
3 GB of memory.

```
from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.addresses import address_by_hostname
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            cores_per_worker=1,
            mem_per_worker=3,
            address=address_by_hostname(),
            provider=SlurmProvider(
                'broadwl',
                nodes_per_block=1,
                init_blocks=1,
                min_blocks=1,
                max_blocks=1,
                mem_per_node=12,
                cores_per_node=4,
                exclusive=False
            ),
        )
    ],
)
```
annawoodard added a commit that referenced this issue Aug 27, 2019
* Implement resource hints for the SlurmProvider

Partially addresses #942.

This commit adds the `cores_per_node` and `mem_per_node` keyword args to
the SlurmProvider. These default to None, and behavior is not modified
in the default case. Setting either has three effects. First, it
modifies the Slurm submit script to request the appropriate cores and/or
memory.  Second, it sets the environment variables `PARSL_MEMORY_GB` and
`PARSL_CORES` on the node. Finally, the `workers_per_node` attribute is
added to the `HighThroughputExecutor` which will be calculated according
to the resource hints, if they are available. This is read by the
strategy piece, enabling a more accurate calculation for scaling
resources up and down. An example configuration, tested on Midway, is
provided below. This configuration requests 4 1-core workers, each with
3 GB of memory.

```
from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.addresses import address_by_hostname
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            cores_per_worker=1,
            mem_per_worker=3,
            address=address_by_hostname(),
            provider=SlurmProvider(
                'broadwl',
                nodes_per_block=1,
                init_blocks=1,
                min_blocks=1,
                max_blocks=1,
                mem_per_node=12,
                cores_per_node=4,
                exclusive=False
            ),
        )
    ],
)
```

* Add default mem_per_node and cores_per_node

* Switch to properties in base class

Also: clarify docstrings.

* Fix flake8

Also: fix incomplete conversion to property.

* Fix setter definition
@annawoodard
Copy link
Collaborator

@Lnaden, @dgasmith

This has been implemented for the SlurmProvider, could you let us know if it addresses your concerns? The main change is that 1) you can request less than a node meaningfully now and 2) if you set mem_per_node and/or cores_per_node, then Parsl will use this information to calculate how many workers will fit on a node in order to make a more intelligent guess when scaling up about how many resources it needs (instead of assuming the worst-case scenario, that it will only be able to run one worker per node).

This hasn't made it into a release yet, but you can install the lastest with: pip install git+https://github.com/parsl/parsl

Here's a config I tested with:

from parsl.config import Config
from parsl.providers import SlurmProvider
from parsl.addresses import address_by_hostname
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            cores_per_worker=1,
            mem_per_worker=3,
            address=address_by_hostname(),
            provider=SlurmProvider(
                'broadwl',
                nodes_per_block=1,
                init_blocks=1,
                min_blocks=1,
                max_blocks=1,
                mem_per_node=12,
                cores_per_node=4,
                exclusive=False
            ),
        )
    ],
)

@Lnaden
Copy link

Lnaden commented Aug 27, 2019

I'll be able to test this tomorrow! Is this tech going to go into the other providers as well such as the PBSProvider and the LSFProvider?

@annawoodard
Copy link
Collaborator

@Lnaden Sure thing, we can add it to PBSProvider and LSFProvider. Let's wait to find out if it is working as you expect on Slurm first, then we'll move on to the remaining providers.

@Lnaden
Copy link

Lnaden commented Aug 27, 2019

Perfectly fine with that. This should be awesome if working correctly. I'll give it a test tomorrow and get back to you!

@Lnaden
Copy link

Lnaden commented Aug 28, 2019

Awesome. This appears to be working thus far as intended. Exclusive mode still works (requests all the cpus) but does restrict memory as expected. This allows me to queue up multiple blocks on the same node and for our use case we only ever have our blocks stretch one node anyways.

From our use case perspective, this appears to be working the way we would need it to. I have not tested oversubscribing the cores per worker > cores_per_node * nodes_per_block as thats not one of our uses cases (and we block users from doing so anyways). I have also not tested multi-node jobs since we don't need that at the moment either. I can do more extensive testing, but it will take some additional time.

The only bug I found was that the mem_per_node seems to accept floats as well as ints, but SLURM complains about float values being provided with an error like error: invalid memory constraint: 32.0, but forcing an int fixes that.

Overall, this appears to be exactly what we needed! Great work!

@Lnaden
Copy link

Lnaden commented Aug 28, 2019

I should also clarify that non-exclusive mode works just fine as well, which is the point.

@annawoodard
Copy link
Collaborator

@Lnaden great, thanks for the bug report and the testing! Which other providers are your highest-priority to get this into first?

@Lnaden
Copy link

Lnaden commented Aug 28, 2019

If I had to order them in priority of this implementation:

  1. TorqueProvider as this is also the one we use for PBS and MOAB queues and the other highly common one beyond SLURM
  2. LSFProvider as we have numerous users for this.
  3. SGEProvider this is our least populous user base, but we still have some.

I do know that because we use the TorqueProvider for Torque, PBS, and MOAB queues, there might be some oddities which are hard to engineer for. However, we have used the TorqueProvider interchangeably thus far without issue. The only mismatch is 1 incident I have seen where the Titan supercomputer (now decommissioned) where ppn was not an accepted field, but that was because they were whole node allocations anyways.

@annawoodard
Copy link
Collaborator

Thanks @Lnaden! Note that we have a relatively-new PBSProProvider which I think fixes the ppn issue. I see we forgot to add it to the documentation and have opened #1235 which adds it. We'll let you know when we have other providers ready for testing.

@Lnaden
Copy link

Lnaden commented Aug 28, 2019

Great to hear. I'll be able to test the first two on my own when they are available, I'll have to enlist some help for the SGEProvider.

relatively-new PBSProProvider which I think fixes the ppn issue

Good to know. When that comes up, I'll be able to test it more. Do you happen to know if there is a quick command to check which flavor of PBS/Torque a user has installed for future reference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants