Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow expiration of running jobs to be adjusted via sched.expiration RPC #1079

Open
Tracked by #4175
grondo opened this issue Sep 25, 2023 · 9 comments · May be fixed by #1158
Open
Tracked by #4175

Allow expiration of running jobs to be adjusted via sched.expiration RPC #1079

grondo opened this issue Sep 25, 2023 · 9 comments · May be fixed by #1158
Assignees

Comments

@grondo
Copy link
Contributor

grondo commented Sep 25, 2023

RFC 27 now defines a sched.expiration RPC which schedulers may implement to support adjustment to the duration/expiration of a running job.

This RPC should be implemented in Fluxion so that its internal endtime for jobs is synchronized with the job execution system when a job expiration is updated.

@grondo
Copy link
Contributor Author

grondo commented Sep 28, 2023

Summarizing a conversation from the team meeting: There is no need to reject a sched.expiration update that overlaps with existing reservation (job or future DAT or downtime reservation). Since this expiration adjustment is considered "administrative" it should be applied to the job in question and the plan discarded. When DAT/downtime reservations are supported, those reservations should be allowed to "overlap" with jobs (The sysadmins will clean up running jobs in this case)

@grondo
Copy link
Contributor Author

grondo commented Nov 1, 2023

A WIP PR with flux-core support for updating the duration of running jobs is in flux-framework/flux-core#5522. It contains the implementation of the sched.expiration RPC to request the expiration update from the scheduler. Once that is merged, Fluxion will need support for that RPC, or the job manager update service will go ahead with the expiration update (it assumes the update is valid if the RPC fails with ENOSYS). This PR is planned to be merged before the Nov release.

@grondo
Copy link
Contributor Author

grondo commented Nov 8, 2023

Update: flux-framework/flux-core#5522 was merged and released with flux-core v0.56.0. Until this issue is resolved, updates of running jobs will be allowed without notification to the Fluxion scheduler, which will have to adapt to the new time limits. In testing, this doesn't seem to be a critical issue, but it would probably be best if the sched.expiration RPC were supported by Fluxion eventually.

@grondo
Copy link
Contributor Author

grondo commented Mar 1, 2024

@milroy: As requested in this week's meeting, some references for implementation of this feature:

@milroy
Copy link
Member

milroy commented Mar 2, 2024

Thanks for consolidating the information @grondo!

@milroy milroy self-assigned this Mar 2, 2024
@milroy
Copy link
Member

milroy commented Mar 10, 2024

RFC 27 states in Expiration:
"The request MAY fail, for example if: [...] The new expiration time would invalidate an advance reservation." Is an advance reservation a system reservation or a job reservation?

@grondo
Copy link
Contributor Author

grondo commented Mar 10, 2024

The intent is to prevent a expiration extension overlapping an administrative reservation (however we end up implementing that), not just a normal reservation that's part of the current schedule plan (if that's what is meant by a job reservation)

However, I do recall @ryanday36 mentioning that admins can just kill jobs running on an administrative reservation if necessary, so maybe we don't actually need to worry about this for now?

@milroy
Copy link
Member

milroy commented Mar 23, 2024

I've implemented the basic expiration functionality in my fork, but am wondering how best to handle the RPC.

Using a simple relay like what's done with sched.resource-status is the most straightforward, but requires fairly extensive modification to the Fluxion planner. This is because the relay callback can be executed at any time in the scheduler loop, meaning that the allocated job that requires updated expiration can extend into one or more reservations (administrative or normal). The extension can render the reservations invalid, requiring them to be pushed back, which can create a cascade of reservation pushbacks. I've implemented much of the logic to handle the cascade and could complete the implementation with a bit more work.

The other route we discussed is to handle the RPC after the scheduler loop, which guarantees that all reservations will be cleared. I have a working implementation in the Fluxion planner, and it's much simpler than dealing with reservation conflicts. However, handling the RPC is clumsier. I think what's needed is to check for a sched.resource-status RPC in the post_sched_loop in the qmanager_cb_t class and relay the RPC to Fluxion. I don't think qmanager_cb_t was designed for sending RPCs, though.

@trws or @grondo do you have suggestions?

@grondo
Copy link
Contributor Author

grondo commented Mar 25, 2024

This is because the relay callback can be executed at any time in the scheduler loop,

I apologize, but I don't know much about the Fluxion planner. However, an RPC callback can't be invoked until you re-enter the Flux reactor, and I'd be a little surprised if this occurs in the middle of a scheduler loop. Feel free to correct me if I'm wrong.

Also, I just noticed that the RPC relay implementation in qmanager referenced above makes a blocking RPC get, so that is not a good example. Instead flux_future_then(3) should be used to schedule handling the response and returning it to the original caller. I can make a PR for this, even though we're probably deprecating sched.resource-status anyway (flux-framework/flux-core#5796)

@milroy milroy linked a pull request Mar 27, 2024 that will close this issue
@milroy milroy linked a pull request Mar 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants