Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix subsystems #1041

Closed
wants to merge 1 commit into from
Closed

fix subsystems #1041

wants to merge 1 commit into from

Conversation

trws
Copy link
Member

@trws trws commented Jun 26, 2023

This is a first cut at fixing multi-subsystem traversals. When I tried one earlier today, resource-query segfaulted trying to access an uninitialized shared_ptr. I'm not sure how the aux traversals got to this point, but it looks like they're pretty thoroughly broken right now. Fixing just the segfault puts it into infinite recursion and hits a stack overflow, so added vertex coloring also.

Any feedback on this, or thoughts on what we can do here would be appreciated. Especially the coloring I'm not sure is actually appropriate since a node colored exploring from one point might need to be explored from another point in the containment tree to find a proper match, but without it the rest of the design fails pretty spectacularly.

Has anyone done recent work with non-containment subsystems?

@codecov
Copy link

codecov bot commented Jun 26, 2023

Codecov Report

Merging #1041 (2cf181d) into master (32f74d6) will decrease coverage by 0.1%.
The diff coverage is 0.0%.

❗ Current head 2cf181d differs from pull request most recent head 6cf07f0. Consider uploading reports for the commit 6cf07f0 to get more accurate results

@@           Coverage Diff            @@
##           master   #1041     +/-   ##
========================================
- Coverage    74.4%   74.3%   -0.1%     
========================================
  Files          86      86             
  Lines        9434    9436      +2     
========================================
- Hits         7021    7017      -4     
- Misses       2413    2419      +6     
Impacted Files Coverage Δ
resource/traversers/dfu_impl.cpp 84.7% <0.0%> (-0.3%) ⬇️

... and 1 file with indirect coverage changes

@milroy
Copy link
Member

milroy commented Jun 27, 2023

Has anyone done recent work with non-containment subsystems?

No recent work, and I'm not sure when/whether the non-containment subsystems worked.

Does the commit fix the observed behavior and you're mainly concerned about side-effects of coloring non-containment vertices? I'll dig into the graph coloring question later this week.

@trws
Copy link
Member Author

trws commented Jun 27, 2023

It makes it so a non-containment completes, though the post order of the aux subsystems pops some errors, they aren't fatal. It's mainly that I'm not sure the coloring gives us the behavior we want for non-containment subsystems.

@tpatki
Copy link
Member

tpatki commented Jul 12, 2023

Need to think a bit more about this from the perspective of coloring or tracking for multi-subsystems. Will take a look in the coming week.

The non-containment subsystems of power and IO work correctly when tested independently (power test case. I am fairly sure the multi-subsystem case didn't work and wasn't tested thoroughly. I recall testing with -S CA+P at some point during the var-aware development work and it failing, and us discussing that we need to add more support for it (I don't think we ever added that).

How are you testing the multi-subsystem case @trws? I don't see a test-case for it for (unless I missed it), I can help debug and test this too.

@trws
Copy link
Member Author

trws commented Jul 12, 2023

I was using the mini-5subsystems-fine.graphml file under the utilities as an input, and ensuring that query could at least run to completion against it with multiple subsystems active. It's still not right, the coloring on auxiliary subsystems means the same vertex can't be matched from multiple sources, which is clearly wrong, but without something to serve as a termination condition the current version of the traversal hangs.

My current thought is that we should try to nail down what the actual traversal behavior is for the "upwalk" since I can't seem to find anything that documents how that's supposed to work. As it was implemented (pre-coloring) the traverser was depth-first, walking all containment edges first, and walking all selected subsystems in subsystem order thereafter. Using coloring for the depth-first portion is correct, though not actually necessary, but what should the behavior be for the other subsystems? I think we need an actual policy, and I'm not sure what that should be. It could be limiting the number of steps away from the source vertex where the subsystem was entered, it could be based on what's in the request, honestly I'm not sure yet.

The main thing I really want to do to help reason about some of this, and make things more efficient, is actually break up the edges in the multigraph by subsystem. We can either do this with the boost::graph "subgraph" feature, so we can treat each subsystem as an actual graph object, or I can throw together a container we can use as the storage for the graph that lets us efficiently access only the edges for one subsystem (probably just a vector with a side-table for offsets based on the subsystem ID). If we have that division, then the DFS can operate directly on the containment tree, and wont need coloring, filtering, or dynamic checks because it's guaranteed to be a tree. Then we can use custom logic only for the others where we need it. What do you think?

@milroy
Copy link
Member

milroy commented Jul 13, 2023

I recall testing with -S CA+P at some point during the var-aware development work and it failing, and us discussing that we need to add more support for it (I don't think we ever added that).

I just tested -S C+PA as part of the investigation of PR #1046 and can confirm it does not work. I did it by testing one of the power-aware jobspecs against the power JGF with -S C+PA and -S PA:

$ ./resource/utilities/resource-query -L t/data/resource/jgfs/power.json -f jgf -S PA -P high
INFO: Loading a matcher: PA
resource-query> m allocate t/data/resource/jobspecs/power/test002.yaml
      ------------power0[10:x]
      ------------power1[10:x]
      ---------node53[1:x]
[...]
$ ./resource/utilities/resource-query -L t/data/resource/jgfs/power.json -f jgf -S C+PA -P high
INFO: Loading a matcher: C+PA
resource-query> m allocate t/data/resource/jobspecs/power/test002.yaml
INFO: =============================
INFO: No matching resources found
INFO: JOBID=1
INFO: =============================

I think this is a byproduct of the graph coloring as @trws mentioned in the first comment.

but what should the behavior be for the other subsystems? I think we need an actual policy, and I'm not sure what that should be. It could be limiting the number of steps away from the source vertex where the subsystem was entered, it could be based on what's in the request, honestly I'm not sure yet.

I agree but don't immediately know how to approach this, either. I'll give it some thought.

@trws
Copy link
Member Author

trws commented Jul 13, 2023 via email

@milroy
Copy link
Member

milroy commented Jul 13, 2023

Updating my comment above with a clearer explanation of one item.

Using coloring for the depth-first portion is correct, though not actually necessary

While coloring isn't necessary, there needs to be some capability for breaking cycles in the multigraph given the edge multiplicity between neighbors varies and is used for matching:

// qualifed slot count is determined by the most constrained resource type

Tom addressed that item in his comment immediately above (which posted between my edits of my comment). Sorry for the confusion.

@trws trws closed this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants