Updates to use pyaestro to back a local pool style adapter. #345

FrankD412 · 2021-02-23T03:29:37Z

This PR is a preliminary step towards a more "ensemble"-ish adapter related to the functionality discussed in #330.

This new LocalPoolAdapter utilizes pyaestro's Executor class in order to fake a scheduler. Some potential future additions are timeouts and resource support for the above #330.

FrankD412 · 2021-02-23T03:36:19Z

@jwhite242 -- Some considerations based on why we were discussing earlier:

I was hoping to eliminate the checks for local execution in the ExecutionGraph._StepRecord. This would bring all execution in line with scheduler-based adapters. You had mentioned some benefits to keeping the distinction around. What are your thoughts?
We need to conceptualize how we would handle job packing in an Executor for pyaestro and then do something similar here. What have you found in terms of packing and the like?

FrankD412 · 2021-02-23T04:11:41Z

Current todo: Cancellation works at the Maestro level, but leaves the pool executing the last processes that were running before the cancellation was posted. This bug boils up to a pyaestro bug.

maestrowf/interfaces/script/localpooladapter.py

jwhite242 · 2021-02-23T16:22:15Z

maestrowf/interfaces/script/localpooladapter.py

+        """
+        return f"#!{self._exec}"
+
+    def get_parallelize_command(self, procs, nodes=None, **kwargs):


Should we generalize this to execute or launch command rather than making parallel special? More just about a hook for a custom launch prefix/suffix/wrapper/etc than parallel in particular right? One scheduler thing i can see used is big memory jobs on hpc might request a whole node just for the memory, but only run serial (e.g. srun -N1 -n1..).

Yeah -- I want to redesign this in pyaestro and then use it here.

jwhite242 · 2021-02-23T16:23:25Z

maestrowf/interfaces/script/localpooladapter.py

+                  identifier.
+        """
+        try:
+            job_id = self._pool.submit(path, cwd)


there a future being tracked in here somewhere?

This pool is entirely backed by pyaestro -- which does implement this with a future.

jwhite242 · 2021-02-23T16:25:49Z

maestrowf/interfaces/script/localpooladapter.py

+                return JobStatusCode.ERROR, {}
+
+        status = {jid: State.UNKNOWN for jid in joblist}
+        try:


Would it be better to put this inside the loop so it can actually get through the whole list and catch all the exceptions? Also, what exception did this get added to catch?

I did this outside the loop since it's a basic operation -- also because the Enum is not a constructor, I couldn't use defaultdict like I wanted. I plan to explore making a meta-enum to handle the default case to make this more Pythonic.

In regards to the except -- it's there to log any possible errors that do come up. It's not necessarily meant to take corrective action.

jwhite242 · 2021-02-23T16:27:40Z

maestrowf/interfaces/script/localpooladapter.py

+
+        c_return = CancelCode.OK
+        ret_code = 0
+        if ExecCancel.FAILED in c_status:


Should this check all of them, i.e. counting the number of cancels that failed and ones that succeeded separately for more informative error messages?

This design choice is a decision carried forward from SLURM (when I initially made this); in that perspective, the status is reported if at least one record reports the state. The issue with things like SLURM is that it's more efficient to do a bulk cancel, so there's no way to get a per job report out there (not without polling the scheduler to see if it took).

jwhite242 · 2021-02-23T16:29:28Z

maestrowf/interfaces/script/localpooladapter.py

+        elif executor_state == ExecTaskState.INITIALIZED:
+            return State.INITIALIZED
+        elif executor_state == ExecTaskState.CANCELLED:
+            return State.CANCELLED


What about a CLOSING or CANCELLING status? I've noticed at least on slurm that if you check it again too quick after cancelling (sometimes up to a minute) you'll get the CG status from the scheduler while it's still cleaning up/shutting it down. Or can pending capture this?

In the current implementation, those don't exist. The pool technically will cancel a job (in this case a process) outright.

jwhite242 · 2021-02-23T16:37:09Z

samples/lulesh/lulesh_sample1_pool_macosx.yaml

+batch:
+    shell: /bin/bash
+    type: local_pool
+    num_workers: 3


we'll definitely need to spend some effort on whatever nomenclature is settled on for this stuff to avoid confusion with workers/cores/tasks/jobs/...

Should I rename this to num_processes instead? I feel like that would be a better description that doesn't alias to other things.

maestrowf/interfaces/script/localpooladapter.py

jwhite242 · 2021-02-23T16:59:09Z

@jwhite242 -- Some considerations based on why we were discussing earlier:

* I was hoping to eliminate the checks for local execution in the `ExecutionGraph._StepRecord`. This would bring all execution in line with scheduler-based adapters. You had mentioned some benefits to keeping the distinction around. What are your thoughts?

* We need to conceptualize how we would handle job packing in an Executor for pyaestro and then do something similar here. What have you found in terms of packing and the like?

So I was thinking this could be a useful distinction based on the two possible run modes: standalone batch jobs (the current scheduler adapters) and running maestro inside of an allocation manually for job packing (or spawned by another tool built on top of maestro). Doesn't necessarily need to use local as the distinction, but do we need some hook to enable the latter job packing mode?

As for the job packing, this is something where i think plugins would be really useful, or some way for users to write these things like with pgen. The number of scheduling behaviors and the optimization algorithms for implementing them is a pretty large space that doesn't necessarily need to be hard wired into maestro. A few simple/interesting ones would be good of course.

FrankD412 · 2021-02-25T17:21:53Z

@jwhite242 -- Some considerations based on why we were discussing earlier:
* I was hoping to eliminate the checks for local execution in the `ExecutionGraph._StepRecord`. This would bring all execution in line with scheduler-based adapters. You had mentioned some benefits to keeping the distinction around. What are your thoughts?

* We need to conceptualize how we would handle job packing in an Executor for pyaestro and then do something similar here. What have you found in terms of packing and the like?
So I was thinking this could be a useful distinction based on the two possible run modes: standalone batch jobs (the current scheduler adapters) and running maestro inside of an allocation manually for job packing (or spawned by another tool built on top of maestro). Doesn't necessarily need to use local as the distinction, but do we need some hook to enable the latter job packing mode?

As for the job packing, this is something where i think plugins would be really useful, or some way for users to write these things like with pgen. The number of scheduling behaviors and the optimization algorithms for implementing them is a pretty large space that doesn't necessarily need to be hard wired into maestro. A few simple/interesting ones would be good of course.

I was thinking about this and the case of running Maestro in an allocation would be where you have an "allocation adapter" class that comes in handy. That class would take in the global set of resources requested and schedule the conductor call. That's actually where I wanted to split out the MPI related functionality from the SchedulerScriptAdapter since you could either monkey patch the method in the LocalPoolAdapter to generate the right MPI or just have it in the batch settings to use a particular MPI (usual factory pattern there to get the class). We could mock that up in a discussion thread.

For this PR, I was thinking if I can get the local pool as the main local adapter (it would still return that it's local, but in the ExecutionGraph it'd be treated just like everything else since it's a process pool. The records would still return local otherwise.

I'm starting to wonder if this is a good time to introduce an orchestrate sub-command to handle the allocation clear so it's more transparent to the user? -- we should make a discussion on this haha

Hopefully this is making sense.

FrankD412 added 5 commits February 22, 2021 21:23

Shift to property tag (abstractproperty deprecated).

5dc026c

Correction of a log statement.

c5e03a2

Addition of initial local pool adapter.

04caed3

Addition of pyaestro as a dependency.

9737055

Updates to pyproject.toml for pyaestro.

f6273b5

FrankD412 added enhancement Adapters Items that are directly related to Maestro's adapter structure. labels Feb 23, 2021

FrankD412 added this to the Release 1.1.9 milestone Feb 23, 2021

FrankD412 assigned FrankD412 and jwhite242 Feb 23, 2021

Addition of a local pool sample.

5e49d62

Update pip to fix crypto build error.

721825b

FrankD412 mentioned this pull request Feb 23, 2021

Executor class leaves processes running after cancellation FrankD412/pyaestro#11

Closed

jwhite242 reviewed Feb 23, 2021

View reviewed changes

kcathey requested changes Feb 23, 2021

View reviewed changes

maestrowf/interfaces/script/localpooladapter.py Outdated Show resolved Hide resolved

FrankD412 added 3 commits February 23, 2021 16:21

Update to the latest pyaestro alpha.

607cf60

Updates to poetry.lock to match updates.

27c3003

Corrections to docstrings and var naming.

71f09b0

FrankD412 marked this pull request as draft March 10, 2021 18:23

FrankD412 removed this from the Release 1.1.9 milestone Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to use pyaestro to back a local pool style adapter. #345

Updates to use pyaestro to back a local pool style adapter. #345

FrankD412 commented Feb 23, 2021

FrankD412 commented Feb 23, 2021

FrankD412 commented Feb 23, 2021

jwhite242 Feb 23, 2021

FrankD412 Feb 23, 2021

jwhite242 Feb 23, 2021

FrankD412 Feb 23, 2021

jwhite242 Feb 23, 2021

FrankD412 Feb 23, 2021

FrankD412 Feb 23, 2021

jwhite242 Feb 23, 2021

FrankD412 Feb 23, 2021

jwhite242 Feb 23, 2021

FrankD412 Feb 23, 2021

jwhite242 Feb 23, 2021

FrankD412 Feb 23, 2021 •

edited

Loading

jwhite242 commented Feb 23, 2021

FrankD412 commented Feb 25, 2021

Updates to use pyaestro to back a local pool style adapter. #345

Are you sure you want to change the base?

Updates to use pyaestro to back a local pool style adapter. #345

Conversation

FrankD412 commented Feb 23, 2021

FrankD412 commented Feb 23, 2021

FrankD412 commented Feb 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankD412 Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

jwhite242 commented Feb 23, 2021

FrankD412 commented Feb 25, 2021

FrankD412 Feb 23, 2021 •

edited

Loading