-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add execution order controls #424
base: develop
Are you sure you want to change the base?
Conversation
docs/Maestro/specification.md
Outdated
|
||
| **Key** | **Required** | **Type** | **Description** | | ||
| :- | :-: | :-: | :- | | ||
| `step_order` | No | str | Type of scheduler managing execution. One of: {`depth-first`, `breadth-first`}. Default: `depth-first`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default here be breadth-first like you say in the sentence above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's definitely a typo. should be fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this addition a lot. Two of my comments here are just suggestions for cleaning up code/planning for the future but if you want to ignore them I'm not going to be offended at all.
if self.weight < other.weight: | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could get fancy here (and for the __gt__
method below) and make this a one-liner: return self.weight < other.weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add onto this, I think you're only required to implement __lt__
and __eq__
since you can build the other comparison methods from them. But I'm only 85% sure from memory 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll throw another wrench into this and point out this implementation is probably not what we want despite it working for the current feature. I.e. step weights can't be the only basis of comparison. The __eq__
gets used for doing step equality, which means we can't use __lt__
to construct that if we stick with this __lt__
that only cares about the step weights. Had kind of punted on what other things to care about in these during the initial implementation, but should definitely be addressed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we leave this as is and then set the weights when we determine priorities?
EDIT: this comment is in reply to the previous discussion started here: #424 (comment)
for weight, (parent, step_name, step) in enumerate(self.walk_study()): | ||
pprint(step) | ||
pprint("Updating weight") | ||
if isinstance(step, StudyStep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there ever a case here where the step is not a StudyStep? Would it just be None?
self.step_prioritization_factory.register_priority_expr( | ||
'step_order', | ||
step_weight_priority_df | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely certain on your future plans for this section of code, but if you're planning on adding more additions to this if/else chain of the same form then maybe it would be better in the long run if you did something like so:
priority_expr_mapper = {
"breadth-first": {
"name": "step-order",
"func": step_weight_priority_bf
},
"depth-first": {
"name": "step-order",
"func": step_weight_priority_df
}
}
self.step_prioritization_factory.register_priority_expr(
priority_expr_mapper[self.step_order]["name"],
priority_expr_mapper[self.step_order]["func"]
)
Then any changes to the register_priority_expr()
method would only have to be modified in one function call rather than multiple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option here is to have some protocoled classes that implement the __call__
method -- then you can just point to the class you want it to use and have them all be callable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, both good ideas, and I think the end game is some mix of the two to get all the work out of the constructor. Haven't quite fleshed out the design for the implementation, but idea with this is to eventually enable users to build expressions from things like step resource keys (procs, walltime, combinations of them), parameters, maybe even step names and layer them on to this priority. Priorities then being the sequential eval of all of those expressions (step order being the only one that's always there) and exploiting the sorting mechanism of tuples to give a lot of fine grained control over the orders.
Of course, if there are other ideas on things we might want to use to control this ordering, let me know!
@@ -0,0 +1,93 @@ | |||
description: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to have an example in the samples directory as well.
@@ -906,12 +920,32 @@ def execute_ready_steps(self): | |||
# Now, we need to take the min of the length of the queue and the | |||
# computed number of slots. We could have free slots, but have less | |||
# in the queue. | |||
_available = min(_available, len(self.ready_steps)) | |||
# _available = min(_available, len(self.ready_steps)) # Kind of inconsistent if it's affected by queue size sometimes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure what you mean by inconsistent here. This prevents the need for the conditional below (at least that's what it's intent was when it was originally coded. It means not checking for the -1
value explicitly also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem with this is PriorityQueue's are like standard Queue's and do not implement a length like the dequeue that used to live here. The size is also only approximate (unsure if this is really only the case in multithreading cases or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this whole block can now be removed as we don't need this check -> the if queue.empty() test on the while loop accomplishes the same behavior and exits when there are no more steps are available
new_steps_running = 0 | ||
LOGGER.debug("Available slots: %d, Ready steps is empty %s", | ||
_available, self.ready_steps.empty() == True) | ||
while not self.ready_steps.empty() and (new_steps_running < _available or _available < 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the original intent of the throttle that was added. The --throttle
option meant that Maestro would not exceed that number of jobs in total. This form of throttle says that we are allowed to launch that number of new steps instead of being a global limit to my understanding. Am I incorrect there?
Also, this conditional is a little convoluted -- since available
could be -1
it adds the extra conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this part needs another data structure to really work right again: queue's size method is only approximate, so isn't reliable for this like the previous implementation was. will need a running steps tracker alongside this to make this throttling less convoluted
Edit: new_steps_running just needs to not be reset every run through i think and then we'll have the intended behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, this actually does work as intended, no need to do my edit. The controls come from _available
getting set using throttle and the number of running steps a little above this loop. So if in_progress
is already = throttle this conditional will not let any new jobs through.
just tested it on one of the hello world specs and verified it would properly limit to 1 job at a time (throttle setting) with the slurm adapter
Will add a comment to explain the variable.
if self.weight < other.weight: | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add onto this, I think you're only required to implement __lt__
and __eq__
since you can build the other comparison methods from them. But I'm only 85% sure from memory 😅
def __init__(self, name, description, | ||
studyenv=None, parameters=None, steps=None, out_path="./"): | ||
def __init__(self, name, description, studyenv=None, | ||
parameters=None, steps=None, out_path="./"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're passing the study environment here, the output path might be redundant. Could be an opportunity to clean up some parameters here.
maestrowf/datastructures/dag.py
Outdated
@@ -127,6 +127,35 @@ def remove_edge(self, src, dest): | |||
logging.debug("Removing edge (%s, %s).", src, dest) | |||
self.adjacency_table[src].remove(dest) | |||
|
|||
def __iter__(self): | |||
""" | |||
Iterator over the graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring could be clearer. It also looks like this is a DFS search. It might be worth mentioning that in the docstring or making it configurable somehow.
self.step_prioritization_factory.register_priority_expr( | ||
'step_order', | ||
step_weight_priority_df | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option here is to have some protocoled classes that implement the __call__
method -- then you can just point to the class you want it to use and have them all be callable.
@jwhite242 -- Is this a dead end at this point? We're assessing Maestro and it's looking like DFS would be good on our end too. Wondering just in case we need to revisit. |
a1f3a35
to
1513ef1
Compare
No, i just got derailed by other things for a bit. Am ramping back up on this now. Having the more general expression based priorities are going to be pretty helpful -> major use case here being getting big/long running variants of steps running sooner, allowing smaller ones to be churned through within the throttle limit alongside it for improved throughput. Thinking more on the protocol question.. I'm on the fence on whether we shouldn't just use abstract base classes and tie info to these things; i.e. per step overrides of expressions. But will play with both and see how they feel |
Fix validator to handle errors at block roots
@FrankD412, @bgunnar5, @jsemler |
actually, just had another thought that might fit nicer, expanding it and making the value more of a 'oneOf' type, so either value or expression, making it more clear that there's two types and avoiding having to do greedy parsing on things to figure it out on our end
|
warning to highlight interaction with throttle setting. Swap to single hued blue color scheme
Continued tweaking/iteration with a mind toward this being amenable to a mix of built-in/user things (think plugins for reusable expressions checkable via the dependencies machinery)
Think of the prioritizer_id (or similar name) as akin to the key used to id script adapters, so we can use this to tag plugin installed things in a way that makes error messaging helpful since it's a standardized place to register these functions. So for the sharing, maestro can tell the recipient that they're missing this plugin somebody was using. Still like the idea of descriptions too, though not sure about making them mandatory given the built-in ones can have that set internally and just serialized to the spec in the workspace |
- name: extract-data | ||
description: | | ||
Extract data and upload to external source, then clean up workspace | ||
run: | ||
cmd: | | ||
# Placeholder for some database upload | ||
python $(UPLOADER) -i $(echo-params.workspace) | ||
|
||
upload_success=$? # Capture retcode of uploader for use later | ||
# Clean up workspace to save space | ||
if [[ upload_success == 0 ]] then | ||
echo 'Data upload successful' | ||
echo 'Cleaning up data files' | ||
rm $(echo-params.workspace)/datafile.txt | ||
else | ||
echo 'Data upload failed' | ||
exit upload_success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this step depend on echo-params
?
clear. | ||
|
||
|
||
These execution orders can be shown with the sample hello-bye world specification as shown in the tutorials: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to Tutorials page here would be helpful
run: | ||
cmd: | | ||
echo "$(FAREWELL), $(NAME)!" > $(BYE_FORMAT) | ||
depends: [say-hello] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this depends statement is changed to depends: [say_hello_*]
can depth-first execution still take place? If so, should this be changed?
|
||
=== "Breadth-first order, throttle=1" | ||
|
||
Breadth first, order of excution marked by colors with lighter colors executing first: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here, should be: "order of execution". This same typo is in the DFO tab below as well
""" | ||
Defines api for Priority Expressions for study steps. | ||
""" | ||
def __call__(self, study_step): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now the type hint could be study_step: "StudyStep"
that way there's at least something until the circular import is resolved. Up to you though
if self.weight < other.weight: | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we leave this as is and then set the weights when we determine priorities?
EDIT: this comment is in reply to the previous discussion started here: #424 (comment)
@@ -317,12 +317,17 @@ def run_study(args): | |||
batch = spec.batch | |||
if "type" not in batch: | |||
batch["type"] = "local" | |||
|
|||
# Check the execution block early, then store it | |||
exec_block = spec.execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no execution block is given, is there a default that needs to be set? Something like:
if "graph-order" not in exec_block:
exec_block["graph-order"] = "breadth-first"
assert step.weight == expected_step_weights[step_name] | ||
|
||
|
||
# NOTE: can we refactor to use a shared study fixture instead of redoing it frequently? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can have a single hello_bye.yaml script and then write fixtures to update certain blocks of the script? In the fixture you can yield control over to whatever test/other fixture needs this then after the yield you can do cleanup to reset the yaml to what it was prior to running whatever test/fixture changed it.
I'm doing something similar with Merlin's CONFIG object on a branch where I'm working on incorporating pytest/coverage into Merlin's test suite: https://github.com/bgunnar5/merlin/blob/feature/coverage/tests/conftest.py#L230. This is definitely subject to change as it's still a WIP but just showcasing an example of what I'm thinking.
self.step_prioritizer = StepPrioritizer() | ||
|
||
for expr in self.exec_list: | ||
# note: likely be nicer to make these actual objects during spec parsing.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this note
pprint(f"Parameterized step name: {parameterized_step_name}, step: {parameterized_step}") | ||
if not parameterized_step: # catch source, which is always none | ||
continue | ||
# NOTE: really can't get the base step name anymore after staging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into this same problem when developing Merlin's status commands so if we could implement an easy way to access this post-staging that would be amazing
Any chance of moving this along? I would love to have DFS implemented for our use case. |
Add machinery for controlling execution order/priority of study steps