Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide parallel_one_host_only via workers config file #5695

Merged
merged 4 commits into from
Jul 3, 2024

Conversation

b10n1k
Copy link
Contributor

@b10n1k b10n1k commented Jun 13, 2024

Copy link

github-actions bot commented Jun 13, 2024

Great PR! Please pay attention to the following items before merging:

Files matching docs/*.asciidoc:

  • Consider generating documentation locally to verify it is rendered correctly using tools/generate-docs

This is an automatically generated QA checklist based on modified files.

docs/WritingTests.asciidoc Outdated Show resolved Hide resolved
etc/openqa/workers.ini Outdated Show resolved Hide resolved
etc/openqa/workers.ini Outdated Show resolved Hide resolved
etc/openqa/workers.ini Outdated Show resolved Hide resolved
etc/openqa/workers.ini Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/API/V1/Worker.pm Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/API/V1/Worker.pm Outdated Show resolved Hide resolved
@b10n1k b10n1k force-pushed the schedulig_mm_158146 branch 2 times, most recently from ab04ef5 to c8f2488 Compare June 13, 2024 13:50
t/24-worker-overall.t Outdated Show resolved Hide resolved
t/24-worker-overall.t Outdated Show resolved Hide resolved
t/24-worker-overall.t Outdated Show resolved Hide resolved
docs/WritingTests.asciidoc Outdated Show resolved Hide resolved
etc/openqa/workers.ini Outdated Show resolved Hide resolved
etc/openqa/workers.ini Outdated Show resolved Hide resolved
lib/OpenQA/WebAPI/Controller/API/V1/Worker.pm Outdated Show resolved Hide resolved
@b10n1k b10n1k force-pushed the schedulig_mm_158146 branch from c8f2488 to b396b5b Compare June 17, 2024 18:24
t/24-worker-overall.t Outdated Show resolved Hide resolved
t/24-worker-settings.t Outdated Show resolved Hide resolved
t/data/24-worker-overall/workers.ini Outdated Show resolved Hide resolved
@b10n1k b10n1k force-pushed the schedulig_mm_158146 branch 2 times, most recently from add06ea to dc4b30a Compare June 18, 2024 13:47
Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.41%. Comparing base (aa7ac4b) to head (dc4b30a).
Report is 2 commits behind head on master.

Current head dc4b30a differs from pull request most recent head 0e31752

Please upload reports for the commit 0e31752 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5695      +/-   ##
==========================================
- Coverage   98.42%   98.41%   -0.01%     
==========================================
  Files         394      394              
  Lines       38516    38447      -69     
==========================================
- Hits        37908    37837      -71     
- Misses        608      610       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@b10n1k b10n1k force-pushed the schedulig_mm_158146 branch 2 times, most recently from d4c2491 to 18a6e84 Compare June 24, 2024 15:05
@b10n1k
Copy link
Contributor Author

b10n1k commented Jun 24, 2024

All the unnecessary changes have been reverted. I couldnt extend the tests further but the existing coverage should be satisfied for now.

@b10n1k b10n1k marked this pull request as ready for review June 24, 2024 15:09
Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I'd move out the two documentation changes just for now.

docs/WritingTests.asciidoc Show resolved Hide resolved
docs/WritingTests.asciidoc Show resolved Hide resolved
@b10n1k b10n1k force-pushed the schedulig_mm_158146 branch 2 times, most recently from 2d60cc4 to 522dfec Compare June 27, 2024 14:05
lib/OpenQA/Schema/Result/Workers.pm Outdated Show resolved Hide resolved
Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this lacks some additions to the scheduler logic. I'll come up with a test case next week to show that.

@Martchus Martchus force-pushed the schedulig_mm_158146 branch 2 times, most recently from af76645 to e3bcfb3 Compare July 1, 2024 16:38
@Martchus
Copy link
Contributor

Martchus commented Jul 1, 2024

I pushed a few more commits to add tests for the scheduler logic and extend it. It was not completely trivial because the evaluation of the setting now needs to be done in a much more dynamic way. I haven't done any manual tests yet.

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my changes I find it good enough to be merged. I am keeping the "not-ready" label until we/I did some manual testing.

@Martchus
Copy link
Contributor

Martchus commented Jul 2, 2024

I now tested this manually with a cluster of 2 jobs and all workers with PARALLEL_ONE_HOST_ONLY=1 as we'd use it in production. None of the jobs had the setting.

  • With all workers on different hosts nothing is scheduled as expected (as we need at least 2 workers with a common host name).
  • As soon as I replaced one worker so that there are now two workers with a common host jobs were scheduled on those two slots as expected.
  • When I replaced one more worker so that there are now two hosts with two slots each in the pool I was able to occupy all worker slots and clusters were still not scheduled across multiple hosts.

So I guess that's good enough.

@Martchus Martchus force-pushed the schedulig_mm_158146 branch from e3bcfb3 to 2a78ce5 Compare July 2, 2024 08:20
okurz

This comment was marked as resolved.

b10n1k and others added 4 commits July 2, 2024 11:04
* Add PARALLEL_ONE_HOST_ONLY to the worker properties like other
  "capabilities"
* Remove absent capabilities from worker properties to make it
  possible to unconfigure the PARALLEL_ONE_HOST_ONLY setting again

Related ticket: https://progress.opensuse.org/issues/158146

Co-authored-by: Marius Kittler <[email protected]>
Signed-off-by: ybonatakis <[email protected]>
* Consider the property "PARALLEL_ONE_HOST_ONLY" when assigning a worker
  slot and it has that worker property set to a truthy value
* Fallback to normal scheduling (excluding all worker slots with the
  "PARALLEL_ONE_HOST_ONLY" property set to a truthy value) so the presence
  of "PARALLEL_ONE_HOST_ONLY" worker slots will hopefully never make things
  worse (despite the fact that `WorkerSlotPicker` is not a full SAT solver
  and therefore limited)
* Consider the property "PARALLEL_ONE_HOST_ONLY" when picking siblings of
  a running job to repair half-scheduled job clusters

Related ticket: https://progress.opensuse.org/issues/158146
@Martchus Martchus force-pushed the schedulig_mm_158146 branch from 2a78ce5 to 0e31752 Compare July 2, 2024 09:05
@Martchus Martchus removed the not-ready label Jul 2, 2024
@Martchus
Copy link
Contributor

Martchus commented Jul 2, 2024

Really annoying that codecov uploads occaisonally fail:

debug - 2024-07-02 09:24:29,772 -- Upload result --- {"result": "RequestResult(error=RequestError(code='HTTP Error 401', params={}, description='{\"detail\":\"Invalid token header. No credentials provided.\"}'), warnings=[], status_code=401, text='{\"detail\":\"Invalid token header. No credentials provided.\"}')"}
error - 2024-07-02 09:24:29,772 -- Upload failed: {"detail":"Invalid token header. No credentials provided."}

Since this isn't treated as an error (the CircleCI codecov step passes) I'll have to re-run all CircleCI tests.

@Martchus
Copy link
Contributor

Martchus commented Jul 3, 2024

I retried yesterday and also today and Codecov is reliably broken. That's also the case on other PRs like #5723.

Strangely, OBS doesn't have the same problem. I checked the upload steps of a few PRs on https://github.com/openSUSE/open-build-service/pull. Maybe we should compare our setup with the one from OBS.

For the time being I checked https://output.circle-artifacts.com/output/job/5a4e33c3-6409-401d-b492-b79ba22e4c05/artifacts/0/cover_db/coverage.html manually and it looks good. So I'm merging this PR now.

@Martchus Martchus merged commit cb1357e into os-autoinst:master Jul 3, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants