Rebase rq2 and add multi match + find #4

zekemorton · 2023-03-24T00:41:48Z

This PR adds incremental progress towards migrating resource query to using the reapi. In this PR, I added functionality in both rq2 and the reapi.

In the reapi, I added functionality to use the traverser to search for resources based on specified criteria. I opted not to implement multiple-match because it seemed simple enough to loop through matches from the rq2 side of things. Is there any reason why we would need to implement this as a part of the reapi?

For rq2, I added functionality for multi match and for find.

For multi match, I create a general purpose execute_allocate function that can be called once for match and in a for loop for multi match. After completing this, It dawned on me that perhaps we do not need both a match and multi match? We could potentially have a single match that can handle both situations.
For find, I implemented the correspond functionality in the reapi and called it from rq2, adapting a lot of what was already in the old resource query.

I have done some basic manual testing, but am still a bit unsure unsure about the shareness tests. Perhaps I will create a simple test case and see if a can get it to pass a part of they PR

Problem: t1006-qmanager-multiqueue.t exploits the fact that jobs may be submitted with incorrect queue specifications when the frobnicator is disabled, but this method won't work once the job manager becomes queue-aware. Submit test jobs with qmanager unloaded, then reconfigure queues and reload qmanager to trigger the job exceptions.

testsuite: fix coverage method for queue exception

Problem: the current system defaults specify a short SYSTEM_MAX_DURATION of 7 days. Propagating the value to the resource graph will result in unexpected behavior and unschedulable jobs as the current time approaches 7 days from the startup. The `uint64_t` types differ from the int64_t used by std::chrono. Update SYSTEM_MAX_DURATION to be a reasonably large time in the future and change the types to comply with std::chrono.

Problem: updated RFC 14 specifies that if the jobspec does not indicate a duration then the acquired resource expiration should be propagated to the job time limit. Currently there is no way to track the current resource expiration in the resource graph. Add a struct based on `std::chrono::time_point`s for the graph start and end times with default values for both. Default values are defined in `system_defaults.hpp`. Add a setter to set the times at resource acquisition.

Problem: updated RFC 14 requires that upon resource acquisition schedulers propagate the expiration set in `R` into `Rv1` fragments allocated to jobs when jobspec duration is not set. Add unpacking of expiration during resource acquisition. Check for invalid and inexpressable values of start and end times and then convert valid times `std::chrono::time_point`s. Set the resource graph duration after successful resource acquisition.

Problem: updated RFC 14 specifies that if a jobspec duration is not set the resource graph expiration should be propagated to the job's time limit. Add validity checks to ensure that if the jobspec duration is longer than the graph duration the job is not scheduled. Add check to ensure that the duration (uint64_t) is less than the expressable int64_t max () value. Check if the scheduled `at` time is negative (invalid) or greater than or equal to the resource graph end time (invalid). However, if the job start time plus the duration is greater than the graph end time we shorten the duration to fit within the remaining time. We know this is a valid schedule since the scheduling traversal has returned successfully for the full duration.

Problem: there are no tests checking whether jobs with duration=0 have their duration set according to RFC14. Add checks for duration inheritance when duration=0 to the test suite.

Problem: there are no checks to ensure that jobspecs with invalid durations are rejected. Add checks for negative durations and overly long durations.

Problem: there are no checks for jobs that request a longer duration than the lifetime of the resource graph. Add checks.

Resource graph duration and job expiration to conform to RFC 14

Problem: fluxion queues are currently configured in the qmanager in the sched-fluxion-qmanager table, but a framework-wide TOML configuration was proposed in RFC 33. Implement RFC33 queues in the qmanager by transforming the RFC 33 configuration into the old sched-fluxion-qmanager syntax. Raise an error if queues are configured in the sched-fluxion table instead of in the RFC33-compliant way. Fixes flux-framework#950.

…-config qmanager: support RFC33 TOML queue config

Problem: There are no release notes for flux-sched v0.25.0 Add notes for this release.

NEWS: add release notes for 0.25.0

Problem: In the current codebase, we mostly use the term reapi, except for the name of the directory hlapi under resource. This naming scheme should be consistent. Rename resource/hlapi to resource/reapi and update relevant makefiles and headers.

reapi: Rename resource/hlapi to resource/reapi

Problem: one of the yaml examples in resource/utilities/README is invalid - it is missing a with: header for a section (and does not load). This fixes it.

…-resource-yaml fix invalid yaml in resource utilities README

Problem: The flux.util.OutputFormat class in flux-core will get an improved constructor where `headings` becomes an optional keyword argument, but this will break the PerfOutputFormat class in t/scripts/flux-tree-helper.py Try the new-style constructor arguments in PerfOutputFormat. If that fails, fall back to the old style so that either works for the flux-tree-helper.py test script.

testsuite: update flux-tree-helper.py for new OutputFormat constructor

Problem: Using ubuntu-latest in the python-lint and python-format builders causes the check to fail with an error message saying that Python 3.6 cannot be found. This PR just switches the version of Ubuntu to match flux-core's Ubuntu version in their python-lint check: 20.04.

github: change ubuntu version for python ci

Problem: flux-core PR 4776 added the ability for queues to be individually started and stopped. As an update, starting/stopping all queues with the `flux queue` command now requires the --all option to be specified, otherwise a warning is output. To avoid the warning, update all callers to specify --all when multiple queues are started or stopped.

testsuite: start/stop all queues with --all option

Problem: In several tests, it is assumed that newly created named queues are "started" by default. This is not a safe assumption going forward. Solution: In tests that create named queues, start the queues immediately after they are configured.

testsuite: do not assume queues started by default

Problem: The version of Fluxion is not reported in dmesg or elsewhere for debugging and error reporting purposes. Log the version before starting main() in both the qmanager and resource modules.

report Fluxion version when broker modules are loaded

Problem: t1008-recovery-none.t expects the job manager to abort the scheduler if a job fails to re-allocate resources during the hello handshake, but this behavior will change soon. Drop this test. The behavior it is looking for will either be addressed by a true fix to flux-framework#991 or the workaround proposed in flux-framework/flux-core#4894.

Problem: t1007-recovery-full.t cancels each test job then runs the cleanup_active_jobs helper function, which does the same thing. Drop the explicit cleanup from the test.

Problem: The specific errors returned from feasibility or alloc request for jobs with jobspec parse errors, including errors job constraints, are lost before the error is returned in the RPC to the user or job validator. Therefore the user only gets a generic error isntead of a specific, more helpful error. Pass the exception error message back up through run_match() so it is available in the error response. Fixes flux-framework#1002

Problem: The jobspec validator may prevent invalid constraints in jobspec from reaching the scheduler, thus thwarting some testing. Reload the job-ingest module without the validator so that the scheduler has to process invalid constraints. Later, load just the feasibility plugin to test the scheduler's feasibility operation. Reorder and fix some tests affected by the loaded validator plugin changes.

resource: improve error messages for jobspec parse errors

Problem: The hostlist constraint match() method overwrites errno with ENOENT when the host is not found in the hostlist. This can change the errno returned by the run() function of the traverser, leading to incorrect error messages from the resource module. Preserve errno in HoslistConstraint::match().

Problem: It is essential that contraint match() methods do not overwrite errno, since the traverser depends on setting and preserving errno to ENODEV for unsaitisfiable jobspecs. Ensure errno is preserved across a call to constraint->match() in the constraints unit tests.

Problem: The testsuite does not verify that constraints with unknown hosts or ranks return an unsatisfiable error instead of internal match error. Add a couple tests that ensure hostlist and rank constraints with an unknown host/rank return unsatisfiable and not "Internal match error."

resource: fix 'Internal match error' when hostlist constraint is provided

Problem: The NEWS.md file needs to be updated for v0.26.0. Add notes for this release to NEWS.md.

NEWS.md: add release notes for flux-sched v0.26.0

Problem: `perf` profiling together with a test replacement of `plus` with an in-place addition reveals that the `plus` function is not inlined by the compiler. Add inline keyword to coerce the compiler to inline the function and others in the namespace. The change improves match performance across a run of the issue reproducer by a factor of >100 for several policies.

Problem: the current design of the multilevel ID match policy makes use of a vector of factors (`m_multilevel_factors`) to keep track of all the added factors and accumulates them to generate a prefix for the overall vertex score. The O(n) operation is repeated for each vertex. Furthermore, the vector is not cleared between match traversals, so the operation becomes increasingly expensive. Replace the vector with a single `int64_t` which contains the sum of all vertex factors. Instead of popping the current vertex's factor from a vector, simply subtract its factor stored in `score_factor_t`. The change accelerates exclusive matching by a factor of about 6.

Improve Fluxion match performance

Problem: we should be able to build the container on request without needing to commit. Solution: add a workflow_dispatch event to the main workflow. Signed-off-by: vsoch <[email protected]>

…-dispatch add workflow dispatch

Fix build for GCC 13

garlick and others added 30 commits September 30, 2022 09:35

Merge pull request flux-framework#978 from garlick/queue_test_fix

409de4e

testsuite: fix coverage method for queue exception

testsuite: check for duration inheritance

e828b14

Problem: there are no tests checking whether jobs with duration=0 have their duration set according to RFC14. Add checks for duration inheritance when duration=0 to the test suite.

testsuite: checks for invalid jobspec duration

23ad79e

Problem: there are no checks to ensure that jobspecs with invalid durations are rejected. Add checks for negative durations and overly long durations.

testsuite: add checks for job durations

28706af

Problem: there are no checks for jobs that request a longer duration than the lifetime of the resource graph. Add checks.

Merge pull request flux-framework#969 from milroy/graph-duration

69f99fb

Resource graph duration and job expiration to conform to RFC 14

Merge pull request flux-framework#980 from jameshcorbett/rfc-33-queue…

bcaa7d6

…-config qmanager: support RFC33 TOML queue config

NEWS: add release notes for 0.25.0

ed9942d

Problem: There are no release notes for flux-sched v0.25.0 Add notes for this release.

Merge pull request flux-framework#981 from garlick/rel25

52ec95a

NEWS: add release notes for 0.25.0

Merge pull request flux-framework#983 from tpatki/rename_hlapi_to_reapi

8caa0c6

reapi: Rename resource/hlapi to resource/reapi

doc: fix invalid yaml in resource utilities README

0d01542

Problem: one of the yaml examples in resource/utilities/README is invalid - it is missing a with: header for a section (and does not load). This fixes it.

Merge pull request flux-framework#982 from flux-framework/fix/invalid…

01ba6d7

…-resource-yaml fix invalid yaml in resource utilities README

Merge pull request flux-framework#985 from grondo/outputfmt-update

b084b9b

testsuite: update flux-tree-helper.py for new OutputFormat constructor

Merge pull request flux-framework#993 from chu11/python_lint

dd3a26d

github: change ubuntu version for python ci

Merge pull request flux-framework#992 from chu11/flux_core_pr4776

93ebe75

testsuite: start/stop all queues with --all option

Merge pull request flux-framework#996 from chu11/flux_core_pr4857

ecf20b3

testsuite: do not assume queues started by default

modules: report Fluxion version on module load

d43ab1b

Problem: The version of Fluxion is not reported in dmesg or elsewhere for debugging and error reporting purposes. Log the version before starting main() in both the qmanager and resource modules.

Merge pull request flux-framework#998 from grondo/report-version

b7a6d35

report Fluxion version when broker modules are loaded

testsuite: drop unnecessary cleanup

c93fa44

Problem: t1007-recovery-full.t cancels each test job then runs the cleanup_active_jobs helper function, which does the same thing. Drop the explicit cleanup from the test.

grondo and others added 29 commits February 6, 2023 09:46

Merge pull request flux-framework#1003 from grondo/issue#1002

3b33a36

resource: improve error messages for jobspec parse errors

Merge pull request flux-framework#1005 from grondo/issue#1004

b92989a

resource: fix 'Internal match error' when hostlist constraint is provided

NEWS.md: add release notes for flux-sched v0.26.0

d9dd4b5

Problem: The NEWS.md file needs to be updated for v0.26.0. Add notes for this release to NEWS.md.

Merge pull request flux-framework#1006 from grondo/news-0.26

f7bdcb5

NEWS.md: add release notes for flux-sched v0.26.0

Merge pull request flux-framework#1007 from milroy/perf-fix

35d3c96

Improve Fluxion match performance

add workflow dispatch

b6f9982

Problem: we should be able to build the container on request without needing to commit. Solution: add a workflow_dispatch event to the main workflow. Signed-off-by: vsoch <[email protected]>

Merge pull request flux-framework#1012 from researchapps/add/workflow…

035e83e

…-dispatch add workflow dispatch

resource: add missing <stdint.h> header

fde7785

resource: fix typos appending to error message

4e42c73

Merge pull request flux-framework#1011 from jwakely/gcc-13-fixes

a9c1bd5

Fix build for GCC 13

rq2: update autotools makefile to make rq2

10db895

rq2: adding skeleton initialization of resource-query replacement

fedf2ec

add working match function

fe7f238

rq2: add info (not yet working)

7d3be3f

util: added arg parsing to rq2

2410e0a

implemented rgraph read from load file

c3def7e

rq2: added control loop

6b0c8e1

rq2: print rgraph after match

9696715

implemented cancel

9e3c1d2

remove redundant function

4ebbd3f

added multi-mactch

e106777

rq2: add find

6bcc93f

zekemorton changed the base branch from master to rq-api-client March 24, 2023 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase rq2 and add multi match + find #4

Rebase rq2 and add multi match + find #4

zekemorton commented Mar 24, 2023

Rebase rq2 and add multi match + find #4

Are you sure you want to change the base?

Rebase rq2 and add multi match + find #4

Conversation

zekemorton commented Mar 24, 2023