Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebase rq2 and add multi match + find #4

Open
wants to merge 74 commits into
base: rq-api-client
Choose a base branch
from

Conversation

zekemorton
Copy link

This PR adds incremental progress towards migrating resource query to using the reapi. In this PR, I added functionality in both rq2 and the reapi.

In the reapi, I added functionality to use the traverser to search for resources based on specified criteria. I opted not to implement multiple-match because it seemed simple enough to loop through matches from the rq2 side of things. Is there any reason why we would need to implement this as a part of the reapi?

For rq2, I added functionality for multi match and for find.

  • For multi match, I create a general purpose execute_allocate function that can be called once for match and in a for loop for multi match. After completing this, It dawned on me that perhaps we do not need both a match and multi match? We could potentially have a single match that can handle both situations.
  • For find, I implemented the correspond functionality in the reapi and called it from rq2, adapting a lot of what was already in the old resource query.

I have done some basic manual testing, but am still a bit unsure unsure about the shareness tests. Perhaps I will create a simple test case and see if a can get it to pass a part of they PR

garlick and others added 30 commits September 30, 2022 09:35
Problem: t1006-qmanager-multiqueue.t exploits the fact that jobs
may be submitted with incorrect queue specifications when the
frobnicator is disabled, but this method won't work once the job
manager becomes queue-aware.

Submit test jobs with qmanager unloaded, then reconfigure queues
and reload qmanager to trigger the job exceptions.
testsuite: fix coverage method for queue exception
Problem: the current system defaults specify a
short SYSTEM_MAX_DURATION of 7 days. Propagating
the value to the resource graph will result in
unexpected behavior and unschedulable jobs as
the current time approaches 7 days from the
startup. The `uint64_t` types differ from
the int64_t used by std::chrono.

Update SYSTEM_MAX_DURATION to be a reasonably
large time in the future and change the
types to comply with std::chrono.
Problem: updated RFC 14 specifies that if the jobspec does
not indicate a duration then the acquired resource expiration
should be propagated to the job time limit. Currently there
is no way to track the current resource expiration in the
resource graph.

Add a struct based on `std::chrono::time_point`s for the
graph start and end times with default values for both.
Default values are defined in `system_defaults.hpp`.
Add a setter to set the times at resource acquisition.
Problem: updated RFC 14 requires that upon resource acquisition
schedulers propagate the expiration set in `R` into `Rv1`
fragments allocated to jobs when jobspec duration is not set.

Add unpacking of expiration during resource acquisition.
Check for invalid and inexpressable values of start and
end times and then convert valid times `std::chrono::time_point`s.
Set the resource graph duration after successful resource
acquisition.
Problem: updated RFC 14 specifies that if a jobspec duration is not set
the resource graph expiration should be propagated to the job's
time limit.

Add validity checks to ensure that if the jobspec duration is longer
than the graph duration the job is not scheduled. Add check to
ensure that the duration (uint64_t) is less than the expressable
int64_t max () value. Check if the scheduled `at` time is negative
(invalid) or greater than or equal to the resource graph end time
(invalid). However, if the job start time plus the duration is
greater than the graph end time we shorten the duration to
fit within the remaining time. We know this is a valid schedule
since the scheduling traversal has returned successfully for the
full duration.
Problem: there are no tests checking whether jobs with
duration=0 have their duration set according to RFC14.

Add checks for duration inheritance when duration=0
to the test suite.
Problem: there are no checks to ensure that jobspecs
with invalid durations are rejected.

Add checks for negative durations and overly long
durations.
Problem: there are no checks for jobs that request a
longer duration than the lifetime of the resource graph.

Add checks.
Resource graph duration and job expiration to conform to RFC 14
Problem: fluxion queues are currently configured in the qmanager
in the sched-fluxion-qmanager table, but a framework-wide
TOML configuration was proposed in RFC 33.

Implement RFC33 queues in the qmanager by transforming the
RFC 33 configuration into the old sched-fluxion-qmanager syntax.

Raise an error if queues are configured in the sched-fluxion
table instead of in the RFC33-compliant way.

Fixes flux-framework#950.
…-config

qmanager: support RFC33 TOML queue config
Problem: There are no release notes for flux-sched v0.25.0

Add notes for this release.
NEWS: add release notes for 0.25.0
Problem: In the current codebase, we mostly use the term reapi,
except for the name of the directory hlapi under resource. This
naming scheme should be consistent.

Rename resource/hlapi to resource/reapi and update relevant
makefiles and headers.
reapi: Rename resource/hlapi to resource/reapi
Problem: one of the yaml examples in resource/utilities/README
is invalid - it is missing a with: header for a section (and does not
load). This fixes it.
…-resource-yaml

fix invalid yaml in resource utilities README
Problem: The flux.util.OutputFormat class in flux-core will get an
improved constructor where `headings` becomes an optional keyword
argument, but this will break the PerfOutputFormat class in
t/scripts/flux-tree-helper.py

Try the new-style constructor arguments in PerfOutputFormat. If that
fails, fall back to the old style so that either works for the
flux-tree-helper.py test script.
testsuite: update flux-tree-helper.py for new OutputFormat constructor
Problem: Using ubuntu-latest in the python-lint and python-format builders
causes the check to fail with an error message saying that Python 3.6
cannot be found.

This PR just switches the version of Ubuntu to match flux-core's Ubuntu
version in their python-lint check: 20.04.
github: change ubuntu version for python ci
Problem: flux-core PR 4776 added the ability for queues to be
individually started and stopped.  As an update, starting/stopping
all queues with the `flux queue` command now requires the --all option
to be specified, otherwise a warning is output.

To avoid the warning, update all callers to specify --all when multiple
queues are started or stopped.
testsuite: start/stop all queues with --all option
Problem: In several tests, it is assumed that newly created named
queues are "started" by default.  This is not a safe assumption
going forward.

Solution: In tests that create named queues, start the queues immediately
after they are configured.
testsuite: do not assume queues started by default
Problem: The version of Fluxion is not reported in dmesg or elsewhere
for debugging and error reporting purposes.

Log the version before starting main() in both the qmanager and
resource modules.
report Fluxion version when broker modules are loaded
Problem: t1008-recovery-none.t expects the job manager to abort the
scheduler if a job fails to re-allocate resources during the hello
handshake, but this behavior will change soon.

Drop this test.  The behavior it is looking for will either be
addressed by a true fix to flux-framework#991 or the workaround proposed in
flux-framework/flux-core#4894.
Problem: t1007-recovery-full.t cancels each test job then runs
the cleanup_active_jobs helper function, which does the same thing.

Drop the explicit cleanup from the test.
grondo and others added 29 commits February 6, 2023 09:46
Problem: The specific errors returned from feasibility or alloc request
for jobs with jobspec parse errors, including errors job constraints,
are lost before the error is returned in the RPC to the user or job
validator. Therefore the user only gets a generic error isntead of
a specific, more helpful error.

Pass the exception error message back up through run_match() so it
is available in the error response.

Fixes flux-framework#1002
Problem: The jobspec validator may prevent invalid constraints in
jobspec from reaching the scheduler, thus thwarting some testing.

Reload the job-ingest module without the validator so that the
scheduler has to process invalid constraints.

Later, load just the feasibility plugin to test the scheduler's
feasibility operation.

Reorder and fix some tests affected by the loaded validator plugin
changes.
resource: improve error messages for jobspec parse errors
Problem: The hostlist constraint match() method overwrites errno
with ENOENT when the host is not found in the hostlist. This can
change the errno returned by the run() function of the traverser,
leading to incorrect error messages from the resource module.

Preserve errno in HoslistConstraint::match().
Problem: It is essential that contraint match() methods do not
overwrite errno, since the traverser depends on setting and preserving
errno to ENODEV for unsaitisfiable jobspecs.

Ensure errno is preserved across a call to constraint->match() in
the constraints unit tests.
Problem: The testsuite does not verify that constraints with unknown
hosts or ranks return an unsatisfiable error instead of internal
match error.

Add a couple tests that ensure hostlist and rank constraints with an
unknown host/rank return unsatisfiable and not "Internal match error."
resource: fix 'Internal match error' when hostlist constraint is provided
Problem: The NEWS.md file needs to be updated for v0.26.0.

Add notes for this release to NEWS.md.
NEWS.md: add release notes for flux-sched v0.26.0
Problem: `perf` profiling together with a test replacement of `plus`
with an in-place addition reveals that the `plus` function is not
inlined by the compiler.

Add inline keyword to coerce the compiler to inline the function and
others in the namespace. The change improves match performance across a
run of the issue reproducer by a factor of >100 for several policies.
Problem: the current design of the multilevel ID match policy makes use
of a vector of factors (`m_multilevel_factors`) to keep track of all
the added factors and accumulates them to generate a prefix for the
overall vertex score. The O(n) operation is repeated for each vertex.
Furthermore, the vector is not cleared between match traversals, so the
operation becomes increasingly expensive.

Replace the vector with a single `int64_t` which contains the sum of all
vertex factors. Instead of popping the current vertex's factor from a
vector, simply subtract its factor stored in `score_factor_t`. The
change accelerates exclusive matching by a factor of about 6.
Problem: we should be able to build the container on request without
needing to commit.
Solution: add a workflow_dispatch event to the main workflow.

Signed-off-by: vsoch <[email protected]>
@zekemorton zekemorton changed the base branch from master to rq-api-client March 24, 2023 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants