Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deferred allocation #987

Open
JaeseungYeom opened this issue Nov 4, 2022 · 15 comments
Open

Deferred allocation #987

JaeseungYeom opened this issue Nov 4, 2022 · 15 comments
Assignees

Comments

@JaeseungYeom
Copy link

JaeseungYeom commented Nov 4, 2022

Problem: support a scheduling request for an allocation to occur at a specific time in the future.

Currently, a reservation of resources occurs as early as possible. However, for supporting workflows that benefit from running tasks across heterogeneous platforms, it is desired to synchronize multiple allocations across different child instances. Such that task 1-10 run on corona while task 11-20 "simultaneously" run on another cluster managed by Flux.
To support such use cases, two things are needed.
One is the deferred allocation capability, and the other is a means to query the allocation delay.
A parent instance can query its remote child instances to find out when is the earliest by which all the children can allocate requested resources. Then, it should be possible to allocate synchronously across instances.

Pushing the reservation time back should also consider back-filing.
To be clear, this is not the same as to try allocating at the earliest after a specific point in time.
I am not entirely sure if the existing issue #963 is the latter case or the same as this.

@milroy
Copy link
Member

milroy commented Feb 23, 2023

A parent instance can query its remote child instances to find out when is the earliest by which all the children can allocate requested resources. Then, it should be possible to allocate synchronously across instances.

To make sure I understand the basics (without getting into too much complexity yet) of the deferred allocation capability, this is a three-part process:

  1. submit the jobspecs to all child instances with a new match_reserve (jobspec) request that reserves the requested resources on each child instance at the earliest time possible and return those times.
  2. Find the latest time returned (T), and for all the child instances that returned earlier times, issue a new match_reserve_at (jobspec, T) which moves the reservation back to time T.
  3. Handle the case where one or more children can't satisfy match_reserve_at (jobspec, T).

Is that basically correct?

@milroy
Copy link
Member

milroy commented Feb 23, 2023

I've confirmed that by manipulating the at time in dfu_traverser_t::run:

int dfu_traverser_t::run (Jobspec::Jobspec &jobspec,
we can achieve the desired behavior. Here I've simulated this by hardcoding at = 3600 in dfu_traverser_t::run and performing a match allocate:

resource-query> match allocate t/data/resource/jobspecs/basics/test001.yaml
      ---------------core35[1:x]
      ------------socket1[1:x]
      ---------node1[1:s]
      ------rack0[1:s]
      ---tiny0[1:s]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=RESERVED
INFO: SCHEDULED AT=3600
INFO: =============================

Of course, there will be a decent amount of development required to add new match_op_t cases and determine the best way to include the desired time in the jobspec.

@milroy
Copy link
Member

milroy commented Mar 12, 2023

Of course, there will be a decent amount of development required to add new match_op_t cases

Actually that is not complicated.

and determine the best way to include the desired time in the jobspec.

As discussed with @grondo during last week's team meeting, we still need to decide how to proceed with this part. The current state of PR #1013 uses the optional system key space to let users set the deferred time. To ensure the allocation doesn't get moved up (which is undesired) or moved back for each match allocate_orelse_reserve, I added code to use a base time (deferred_from in epoch seconds) which makes deferred_start a relative time:

jobspec.attributes.system.optional.find ("deferred_from");

An example test jobspec looks like this:

version: 9999
resources:
    - type: cluster
      count: 1
      with:
        - type: rack
          count: 1
          with:
            - type: node
              count: 1
              with:
                  - type: slot
                    count: 1
                    label: default
                    with:
                      - type: socket
                        count: 1
                        with:
                          - type: core
                            count: 1
# a comment
attributes:
  system:
    duration: 3600
    # optional deferred keys
    deferred_start: 1800
    deferred_from: 0
tasks:
  - command: [ "app" ]
    slot: default
    count:
      per_slot: 1

My sense is that while this may work well for automated submission it will be hard for manual submission. @jameshcorbett and @ryanday36 might have good input here.

@vsoch
Copy link
Member

vsoch commented Mar 23, 2023

The problem is that you need to be able to define those attributes without writing a yaml file every time?

We are working on a shape spec for resources - flux-framework/rfc#371 maybe we need the same for system attributes? Ping @trws

@garlick
Copy link
Member

garlick commented Mar 23, 2023

Would the submit time (called t_submit in qmanager) work as the deferred_from value?

@grondo
Copy link
Contributor

grondo commented Mar 23, 2023

The problem is that you need to be able to define those attributes without writing a yaml file every time?

There is already a facility for specifying system attributes on the command line of the submission commands (See documentation of --setattr in e.g. flux-run(1))

@grondo
Copy link
Contributor

grondo commented Mar 23, 2023

Would the submit time (called t_submit in qmanager) work as the deferred_from value?

That is a great idea. I was going to suggest something similar in that t_submit could be the default if deferred_from is not set (in case allowing a different deferred_from is useful in testing?)

@ryanday36
Copy link

I think that t_submit probably makes sense for a default deferred_from value. I'm not quite clear, does the current implementation allow the user to set an absolute time, or just a relative time? It seems like the best interface for users would allow them to say something like --setattr=deferred_start=3pm or --setattr=deferred_start=+2.1h (i.e. take the same datetime formats as the current --begin-time flag.

I was also thinking more about what keyword would make sense for this. I'm leaning toward something more like 'reserve_time' or 'reserve_start', or maybe 'require_start' since it will raise an exception on the job if it can't start at that time.

@grondo
Copy link
Contributor

grondo commented Mar 23, 2023

The --begin-time option uses a timestamp (absolute time) which is obtained by parsing the user's argument with our Python parse_datetime() function:

       --begin-time=DATETIME
              Convenience  option  for  setting  a begin-time dependency for a
              job.  The job is guaranteed to start after  the  specified  date
              and  time.   If  DATETIME  begins  with  a + character, then the
              remainder is considered to be an offset in Flux  standard  dura‐
              tion  (RFC  23),  otherwise, any datetime expression accepted by
              the Python parsedatetime module  is  accepted,  e.g.  2021-06-21
              8am, in an hour, tomorrow morning, etc.

It would be nice to support something similar here.

If we can add whatever option we call this to the jobspec RFC, then perhaps it would make sense to expose this as a similar option in the submission commands?

Or, would it be too kludgy to add some kind of sentinel to --begin-time to make it set this option in jobspec instead of a dependency? (e.g. --begin-time=force:3pm) Meh, just throwing that out there. Simple enough and probably clearer to add a --require-start=3pm option. Still, if we are exposing an option in the core submission commands, we should have the resulting jobspec properties documented in the RFC.

@vsoch
Copy link
Member

vsoch commented Mar 23, 2023

@grondo why should we require users to figure out timestamps / timezones? Isn't it easier (or minimally should be an option) to provide relative times? E.g., what if you are doing some kind of flux proxy to an instance in a different timezone and then you get it wrong (or minimally have to convert which is a hairball I don't think we want to dive into).

A suggestion - if begin time is already a thing (and indeed it's actually a time to begin) why not have a --start that provides the same but is relative? E.g., --start=60 (start in an hour) and then I don't have to think about actual times (thank goodness!)

Reference for time pain: https://gist.github.com/timvisee/fcda9bbdff88d45cc9061606b4b923ca ⏲️ 😱

@grondo
Copy link
Contributor

grondo commented Mar 23, 2023

I'm confused. As shown above, the interface does not require users to actually specify the timestamp. The begin time can be specified as an offsite or absolute time or any other format supported by parsedatetime.

@vsoch
Copy link
Member

vsoch commented Mar 23, 2023

Oh I see, if you add + it is an offset? Sorry I'm just really stupid.

@vsoch
Copy link
Member

vsoch commented Mar 23, 2023

I'll just see myself out, I'm not really helping anyone.

@garlick
Copy link
Member

garlick commented Mar 23, 2023

I think I'm having one of those days myself FWIW.

@milroy
Copy link
Member

milroy commented Mar 24, 2023

I think that t_submit probably makes sense for a default deferred_from value. I'm not quite clear, does the current implementation allow the user to set an absolute time, or just a relative time?

I didn't know about t_submit and that does sound like the right default choice.

I just realized I obfuscated a crucial detail with deferred_from: 0 in my example jobspec above. That value is the epoch time in seconds. Here's how it's used in the PR currently: 90f8229.

I could certainly implement what @grondo suggested from the --begin-time option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants