-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: partial cancel #16
base: main
Are you sure you want to change the base?
Conversation
The partial cancel request seems to work, but later when I try to do Info or Cancel (the regular cancel) for the job I get an error (I assume the job does not exist, but because of the bug with the error messages not passing forward I cannot see any further output for it). Signed-off-by: vsoch <[email protected]>
OK I went ahead and built with the grow-api branch, and made sure I had a fresh (not used yet) graph, and first I'm testing creating an allocation and then issuing a partial cancel with the entire graph (which I think should work)? Match allocate for 2 nodes (all of graph resources)
----Match Allocate output---
jobid: 1
reserved: false
allocated: {"graph": {"nodes": [{"id": "43", "metadata": {"type": "core", "id": 35, "rank": -1, "exclusive": true, "paths": {"containment": "/tiny0/rack0/node0/socket1/core35"}}}, {"id": "5", "metadata": {"type": "socket", "id": 1, "rank": -1, "exclusive": true, "paths": {"containment": "/tiny0/rack0/node0/socket1"}}}, {"id": "2", "metadata": {"type": "node", "id": 0, "rank": -1, "paths": {"containment": "/tiny0/rack0/node0"}}}, {"id": "79", "metadata": {"type": "core", "id": 35, "rank": -1, "exclusive": true, "paths": {"containment": "/tiny0/rack0/node1/socket1/core35"}}}, {"id": "7", "metadata": {"type": "socket", "id": 1, "rank": -1, "exclusive": true, "paths": {"containment": "/tiny0/rack0/node1/socket1"}}}, {"id": "3", "metadata": {"type": "node", "id": 1, "rank": -1, "paths": {"containment": "/tiny0/rack0/node1"}}}, {"id": "1", "metadata": {"type": "rack", "id": 0, "rank": -1, "paths": {"containment": "/tiny0/rack0"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "tiny", "id": 0, "rank": -1, "paths": {"containment": "/tiny0"}}}], "edges": [{"source": "5", "target": "43"}, {"source": "2", "target": "5"}, {"source": "1", "target": "2"}, {"source": "7", "target": "79"}, {"source": "3", "target": "7"}, {"source": "1", "target": "3"}, {"source": "0", "target": "1"}]}}
at: 0
error: <nil>
remove: partial_cancel returned error.
cancel: ERROR: error encountered while removing job 1
2025/01/31 06:45:50 Error in ReapiClient PartialCancel: issue resource api client partial cancel -1
panic: Error in ReapiClient PartialCancel: issue resource api client partial cancel -1
goroutine 1 [running]:
log.Panicf({0x4c1d42?, 0xc000118008?}, {0xc00015fd50?, 0x1?, 0x1?})
/usr/local/go/src/log/log.go:439 +0x65
main.main()
/workspaces/fluxion-go/cmd/test/test.go:146 +0x12cf
make: *** [Makefile:32: test-binary] Error 2 That has the grow-api errors added, and the error there is cancel: ERROR: error encountered while removing job 1 Which isn't super informative. If I change that to a partial jgf (e.g., remove a path to one of the cores) I get the same response. Oh interesting! If I just do a cancel, the cancel works, and then the partial cancel works, I assume because it does nothing. But that is strange because the result is the same regardless of how I set noent_ok. I am probably doing something wrong, so will call it a night. |
2c7b7b8
to
f59eda2
Compare
Problem: the current testing is not standard for Go, and makes it hard to understand or run in units. Solution: move testing into proper test alongside package, and break apart testing for cancel and match. Additionally, we are using the same graphs / jobspecs from flux-sched. Signed-off-by: vsoch <[email protected]>
ba13d39
to
88050db
Compare
Signed-off-by: vsoch <[email protected]>
88050db
to
776aac8
Compare
Problem: older bases do not have new enough gcc to build flux-sched. Signed-off-by: vsoch <[email protected]>
fc7a62d
to
30b3b2f
Compare
The partial cancel request seems to work, but later when I try to do Info or Cancel (the regular cancel) for the job I get an error (I assume it is because the jobid does not exist, but because of the bug with the error messages not passing forward fixed in the grow-api PR to flux-sched I cannot see any further output for it).
@milroy the tests are expected to fail here because of my note above, and I think I can't see the error because of the issue fixed with grow-api. For some context, the partial cancel fluxcli function here is going to go into the fluxion service here so that fluxqueue (updated code not pushed yet) can use it. I can explain better in a meeting, but high level what we do is the following:
It's not working at the moment, at least for a test here that I think should work, but it's also getting late so maybe I'm not working. Let's discuss at next meeting. I think next I'm going to try building with your grow-api PR branch to see if there are other errors I can see.