Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: partial cancel #16

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

wip: partial cancel #16

wants to merge 4 commits into from

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Jan 31, 2025

The partial cancel request seems to work, but later when I try to do Info or Cancel (the regular cancel) for the job I get an error (I assume it is because the jobid does not exist, but because of the bug with the error messages not passing forward fixed in the grow-api PR to flux-sched I cannot see any further output for it).

@milroy the tests are expected to fail here because of my note above, and I think I can't see the error because of the issue fixed with grow-api. For some context, the partial cancel fluxcli function here is going to go into the fluxion service here so that fluxqueue (updated code not pushed yet) can use it. I can explain better in a meeting, but high level what we do is the following:

  1. Before a pod or job is unsuspended / ungated, we annotate it with the jobid, the node assigned and the core paths assigned.
  2. I first tested generating and adding the cancel request to the pod as a label. That was a terrible idea given how large the graph can get, and for Job you need to combine them all into one.
  3. I decided a better idea was to have this operator performed by the partial cancel endpoint provided by fluxion as a service, the reason because it already stores the entire graph, and the partial graph can be assembled from the node paths alone (this was actually quite hard, I did it 4 different ways, one with recursion, and it took me all day).
  4. When a pod is terminated, we discover the node paths, and can issue the partial cancel request to fluxion with them.
  5. Fluxion will generate the jgf, issue the partial cancel, and we hope it works!

It's not working at the moment, at least for a test here that I think should work, but it's also getting late so maybe I'm not working. Let's discuss at next meeting. I think next I'm going to try building with your grow-api PR branch to see if there are other errors I can see.

The partial cancel request seems to work, but later when
I try to do Info or Cancel (the regular cancel) for the
job I get an error (I assume the job does not exist,
but because of the bug with the error messages not passing
forward I cannot see any further output for it).

Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Member Author

vsoch commented Jan 31, 2025

OK I went ahead and built with the grow-api branch, and made sure I had a fresh (not used yet) graph, and first I'm testing creating an allocation and then issuing a partial cancel with the entire graph (which I think should work)?

Match allocate for 2 nodes (all of graph resources)
        ----Match Allocate output---
jobid: 1
reserved: false
allocated: {"graph": {"nodes": [{"id": "43", "metadata": {"type": "core", "id": 35, "rank": -1, "exclusive": true, "paths": {"containment": "/tiny0/rack0/node0/socket1/core35"}}}, {"id": "5", "metadata": {"type": "socket", "id": 1, "rank": -1, "exclusive": true, "paths": {"containment": "/tiny0/rack0/node0/socket1"}}}, {"id": "2", "metadata": {"type": "node", "id": 0, "rank": -1, "paths": {"containment": "/tiny0/rack0/node0"}}}, {"id": "79", "metadata": {"type": "core", "id": 35, "rank": -1, "exclusive": true, "paths": {"containment": "/tiny0/rack0/node1/socket1/core35"}}}, {"id": "7", "metadata": {"type": "socket", "id": 1, "rank": -1, "exclusive": true, "paths": {"containment": "/tiny0/rack0/node1/socket1"}}}, {"id": "3", "metadata": {"type": "node", "id": 1, "rank": -1, "paths": {"containment": "/tiny0/rack0/node1"}}}, {"id": "1", "metadata": {"type": "rack", "id": 0, "rank": -1, "paths": {"containment": "/tiny0/rack0"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "tiny", "id": 0, "rank": -1, "paths": {"containment": "/tiny0"}}}], "edges": [{"source": "5", "target": "43"}, {"source": "2", "target": "5"}, {"source": "1", "target": "2"}, {"source": "7", "target": "79"}, {"source": "3", "target": "7"}, {"source": "1", "target": "3"}, {"source": "0", "target": "1"}]}}

at: 0
error: <nil>
remove: partial_cancel returned error.
cancel: ERROR: error encountered while removing job 1

2025/01/31 06:45:50 Error in ReapiClient PartialCancel: issue resource api client partial cancel -1
panic: Error in ReapiClient PartialCancel: issue resource api client partial cancel -1


goroutine 1 [running]:
log.Panicf({0x4c1d42?, 0xc000118008?}, {0xc00015fd50?, 0x1?, 0x1?})
        /usr/local/go/src/log/log.go:439 +0x65
main.main()
        /workspaces/fluxion-go/cmd/test/test.go:146 +0x12cf
make: *** [Makefile:32: test-binary] Error 2

That has the grow-api errors added, and the error there is

cancel: ERROR: error encountered while removing job 1

Which isn't super informative. If I change that to a partial jgf (e.g., remove a path to one of the cores) I get the same response.

Oh interesting! If I just do a cancel, the cancel works, and then the partial cancel works, I assume because it does nothing. But that is strange because the result is the same regardless of how I set noent_ok. I am probably doing something wrong, so will call it a night.

@vsoch vsoch force-pushed the add-partial-cancel branch 2 times, most recently from 2c7b7b8 to f59eda2 Compare February 5, 2025 05:26
Problem: the current testing is not standard for Go,
and makes it hard to understand or run in units.
Solution: move testing into proper test alongside
package, and break apart testing for cancel and match.
Additionally, we are using the same graphs / jobspecs
from flux-sched.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the add-partial-cancel branch 3 times, most recently from ba13d39 to 88050db Compare February 8, 2025 05:34
@vsoch vsoch force-pushed the add-partial-cancel branch from 88050db to 776aac8 Compare February 8, 2025 05:35
Problem: older bases do not have new enough
gcc to build flux-sched.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the add-partial-cancel branch from fc7a62d to 30b3b2f Compare February 8, 2025 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant