forked from flux-framework/flux-sched
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebase rq2 and add multi match + find #4
Open
zekemorton
wants to merge
74
commits into
milroy:rq-api-client
Choose a base branch
from
zekemorton:rebase-rq2
base: rq-api-client
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Problem: t1006-qmanager-multiqueue.t exploits the fact that jobs may be submitted with incorrect queue specifications when the frobnicator is disabled, but this method won't work once the job manager becomes queue-aware. Submit test jobs with qmanager unloaded, then reconfigure queues and reload qmanager to trigger the job exceptions.
testsuite: fix coverage method for queue exception
Problem: the current system defaults specify a short SYSTEM_MAX_DURATION of 7 days. Propagating the value to the resource graph will result in unexpected behavior and unschedulable jobs as the current time approaches 7 days from the startup. The `uint64_t` types differ from the int64_t used by std::chrono. Update SYSTEM_MAX_DURATION to be a reasonably large time in the future and change the types to comply with std::chrono.
Problem: updated RFC 14 specifies that if the jobspec does not indicate a duration then the acquired resource expiration should be propagated to the job time limit. Currently there is no way to track the current resource expiration in the resource graph. Add a struct based on `std::chrono::time_point`s for the graph start and end times with default values for both. Default values are defined in `system_defaults.hpp`. Add a setter to set the times at resource acquisition.
Problem: updated RFC 14 requires that upon resource acquisition schedulers propagate the expiration set in `R` into `Rv1` fragments allocated to jobs when jobspec duration is not set. Add unpacking of expiration during resource acquisition. Check for invalid and inexpressable values of start and end times and then convert valid times `std::chrono::time_point`s. Set the resource graph duration after successful resource acquisition.
Problem: updated RFC 14 specifies that if a jobspec duration is not set the resource graph expiration should be propagated to the job's time limit. Add validity checks to ensure that if the jobspec duration is longer than the graph duration the job is not scheduled. Add check to ensure that the duration (uint64_t) is less than the expressable int64_t max () value. Check if the scheduled `at` time is negative (invalid) or greater than or equal to the resource graph end time (invalid). However, if the job start time plus the duration is greater than the graph end time we shorten the duration to fit within the remaining time. We know this is a valid schedule since the scheduling traversal has returned successfully for the full duration.
Problem: there are no tests checking whether jobs with duration=0 have their duration set according to RFC14. Add checks for duration inheritance when duration=0 to the test suite.
Problem: there are no checks to ensure that jobspecs with invalid durations are rejected. Add checks for negative durations and overly long durations.
Problem: there are no checks for jobs that request a longer duration than the lifetime of the resource graph. Add checks.
Resource graph duration and job expiration to conform to RFC 14
Problem: fluxion queues are currently configured in the qmanager in the sched-fluxion-qmanager table, but a framework-wide TOML configuration was proposed in RFC 33. Implement RFC33 queues in the qmanager by transforming the RFC 33 configuration into the old sched-fluxion-qmanager syntax. Raise an error if queues are configured in the sched-fluxion table instead of in the RFC33-compliant way. Fixes flux-framework#950.
…-config qmanager: support RFC33 TOML queue config
Problem: There are no release notes for flux-sched v0.25.0 Add notes for this release.
NEWS: add release notes for 0.25.0
Problem: In the current codebase, we mostly use the term reapi, except for the name of the directory hlapi under resource. This naming scheme should be consistent. Rename resource/hlapi to resource/reapi and update relevant makefiles and headers.
reapi: Rename resource/hlapi to resource/reapi
Problem: one of the yaml examples in resource/utilities/README is invalid - it is missing a with: header for a section (and does not load). This fixes it.
…-resource-yaml fix invalid yaml in resource utilities README
Problem: The flux.util.OutputFormat class in flux-core will get an improved constructor where `headings` becomes an optional keyword argument, but this will break the PerfOutputFormat class in t/scripts/flux-tree-helper.py Try the new-style constructor arguments in PerfOutputFormat. If that fails, fall back to the old style so that either works for the flux-tree-helper.py test script.
testsuite: update flux-tree-helper.py for new OutputFormat constructor
Problem: Using ubuntu-latest in the python-lint and python-format builders causes the check to fail with an error message saying that Python 3.6 cannot be found. This PR just switches the version of Ubuntu to match flux-core's Ubuntu version in their python-lint check: 20.04.
github: change ubuntu version for python ci
Problem: flux-core PR 4776 added the ability for queues to be individually started and stopped. As an update, starting/stopping all queues with the `flux queue` command now requires the --all option to be specified, otherwise a warning is output. To avoid the warning, update all callers to specify --all when multiple queues are started or stopped.
testsuite: start/stop all queues with --all option
Problem: In several tests, it is assumed that newly created named queues are "started" by default. This is not a safe assumption going forward. Solution: In tests that create named queues, start the queues immediately after they are configured.
testsuite: do not assume queues started by default
Problem: The version of Fluxion is not reported in dmesg or elsewhere for debugging and error reporting purposes. Log the version before starting main() in both the qmanager and resource modules.
report Fluxion version when broker modules are loaded
Problem: t1008-recovery-none.t expects the job manager to abort the scheduler if a job fails to re-allocate resources during the hello handshake, but this behavior will change soon. Drop this test. The behavior it is looking for will either be addressed by a true fix to flux-framework#991 or the workaround proposed in flux-framework/flux-core#4894.
Problem: t1007-recovery-full.t cancels each test job then runs the cleanup_active_jobs helper function, which does the same thing. Drop the explicit cleanup from the test.
Problem: The specific errors returned from feasibility or alloc request for jobs with jobspec parse errors, including errors job constraints, are lost before the error is returned in the RPC to the user or job validator. Therefore the user only gets a generic error isntead of a specific, more helpful error. Pass the exception error message back up through run_match() so it is available in the error response. Fixes flux-framework#1002
Problem: The jobspec validator may prevent invalid constraints in jobspec from reaching the scheduler, thus thwarting some testing. Reload the job-ingest module without the validator so that the scheduler has to process invalid constraints. Later, load just the feasibility plugin to test the scheduler's feasibility operation. Reorder and fix some tests affected by the loaded validator plugin changes.
resource: improve error messages for jobspec parse errors
Problem: The hostlist constraint match() method overwrites errno with ENOENT when the host is not found in the hostlist. This can change the errno returned by the run() function of the traverser, leading to incorrect error messages from the resource module. Preserve errno in HoslistConstraint::match().
Problem: It is essential that contraint match() methods do not overwrite errno, since the traverser depends on setting and preserving errno to ENODEV for unsaitisfiable jobspecs. Ensure errno is preserved across a call to constraint->match() in the constraints unit tests.
Problem: The testsuite does not verify that constraints with unknown hosts or ranks return an unsatisfiable error instead of internal match error. Add a couple tests that ensure hostlist and rank constraints with an unknown host/rank return unsatisfiable and not "Internal match error."
resource: fix 'Internal match error' when hostlist constraint is provided
Problem: The NEWS.md file needs to be updated for v0.26.0. Add notes for this release to NEWS.md.
NEWS.md: add release notes for flux-sched v0.26.0
Problem: `perf` profiling together with a test replacement of `plus` with an in-place addition reveals that the `plus` function is not inlined by the compiler. Add inline keyword to coerce the compiler to inline the function and others in the namespace. The change improves match performance across a run of the issue reproducer by a factor of >100 for several policies.
Problem: the current design of the multilevel ID match policy makes use of a vector of factors (`m_multilevel_factors`) to keep track of all the added factors and accumulates them to generate a prefix for the overall vertex score. The O(n) operation is repeated for each vertex. Furthermore, the vector is not cleared between match traversals, so the operation becomes increasingly expensive. Replace the vector with a single `int64_t` which contains the sum of all vertex factors. Instead of popping the current vertex's factor from a vector, simply subtract its factor stored in `score_factor_t`. The change accelerates exclusive matching by a factor of about 6.
Improve Fluxion match performance
Problem: we should be able to build the container on request without needing to commit. Solution: add a workflow_dispatch event to the main workflow. Signed-off-by: vsoch <[email protected]>
…-dispatch add workflow dispatch
Fix build for GCC 13
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds incremental progress towards migrating resource query to using the reapi. In this PR, I added functionality in both rq2 and the reapi.
In the reapi, I added functionality to use the traverser to search for resources based on specified criteria. I opted not to implement multiple-match because it seemed simple enough to loop through matches from the rq2 side of things. Is there any reason why we would need to implement this as a part of the reapi?
For rq2, I added functionality for multi match and for find.
match
that can handle both situations.I have done some basic manual testing, but am still a bit unsure unsure about the shareness tests. Perhaps I will create a simple test case and see if a can get it to pass a part of they PR