Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not force fail ceph-mgr after reconfiguring OSDs #770

Merged

Conversation

fmount
Copy link
Contributor

@fmount fmount commented Dec 18, 2024

@fmount fmount force-pushed the ceph-mgr-issue branch 2 times, most recently from 9d112bc to 03ebc59 Compare December 20, 2024 09:53
@fmount fmount force-pushed the ceph-mgr-issue branch 2 times, most recently from f35d37f to eeae679 Compare December 20, 2024 20:28
@fmount
Copy link
Contributor Author

fmount commented Jan 2, 2025

Seen a failure once with this patch https://logserver.rdoproject.org/41/55541/11/check/periodic-adoption-multinode-to-crc-ceph-1/3071151/controller/data-plane-adoption-tests-repo/data-plane-adoption/tests/logs/test_ceph_migration_out_2024-12-24T02:52:12EST.log

ack, thank you for the help with testing. I think there's another point in the code that I need to update with the same include_tasks (which is exactly where the failure you hit stops the execution). I'm updating this patch and start another round of rechecks.

@fmount
Copy link
Contributor Author

fmount commented Jan 7, 2025

Looks like we can observe the new update working in the last rechecks.
I'll run a set of new tests to get any potential failure that invalidate the current work.

  1. Ceph migration: passed
  2. Ceph migration: passed
  3. Ceph migration: passed
  4. Ceph migration: passed
  5. Ceph migration: passed
  6. Ceph migration: passed
  7. Ceph migration: passed
  8. Ceph migration: passed
  9. Ceph migration: passed - ceph-mgr was stuck and the check introduced by this patch has been executed.
  10. Ceph migration: passed - ceph-mgr was stuck and the code of this patch has been executed.
  11. Ceph migration: passed
  12. Ceph migration: passed
  13. Ceph migration: passed
  14. Ceph migration: passed
  15. Ceph migration: passed
  16. Ceph migration: passed
  17. Ceph migration: not executed - Unrelated failure in the data-plane-adoption step

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/f86ad3901edf4c4cba86c3eae942359c

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 3h 01m 27s
adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 48m 07s

@fmount
Copy link
Contributor Author

fmount commented Jan 10, 2025

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8a49b2e09ae9418d8ba653041ed0dffe

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 2h 59m 22s
adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 47m 26s

@fmount
Copy link
Contributor Author

fmount commented Jan 12, 2025

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/4e87dd22c89843f5bebc74ec3a915d62

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 2h 58m 07s
adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 46m 41s

@fmount
Copy link
Contributor Author

fmount commented Jan 13, 2025

/retest

@fmount
Copy link
Contributor Author

fmount commented Jan 13, 2025

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/7f98c552c6e3435d878c506ae938362a

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 3h 01m 05s
adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 23m 22s

@fmount
Copy link
Contributor Author

fmount commented Jan 14, 2025

recheck

@fmount fmount requested review from jistr and frenzyfriday January 14, 2025 13:06
@fmount
Copy link
Contributor Author

fmount commented Jan 14, 2025

@jistr @frenzyfriday as per the (multiple) tests we run against this patch I think it's ready to merge. We observed (in the last ~20 runs) that the fail_mgr set of tasks allow to not hit the TIMEOUT issue due the ceph mgr fail command.
If you're ok with that, can you help landing this patch so we can start checking the periodic executions and (hopefully) solve the associated CIX?

@frenzyfriday
Copy link
Contributor

/lgtm

@ciecierski
Copy link
Contributor

/lgtm
/approve

Copy link

openshift-ci bot commented Jan 15, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ciecierski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ciecierski
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 15, 2025
@karelyatin
Copy link
Contributor

/lgtm
Just unclear why we didn't saw the same issue with similar downstream jobs, may be you know?

@fmount
Copy link
Contributor Author

fmount commented Jan 15, 2025

/lgtm Just unclear why we didn't saw the same issue with similar downstream jobs, may be you know?

I'm not entirely sure, but I suspect it's still the same bug we saw in TripleO in different circumstances. It might depend on the Ceph node size (mostly in terms of memory, if they differ), but I can't really confirm.

@karelyatin
Copy link
Contributor

/lgtm Just unclear why we didn't saw the same issue with similar downstream jobs, may be you know?

I'm not entirely sure, but I suspect it's still the same bug we saw in TripleO in different circumstances. It might depend on the Ceph node size (mostly in terms of memory, if they differ), but I can't really confirm.
Ok yes could be, downstream i see using larger flavors 12vCPU/24GB Memory vs 8vCPU/16GB memory

@fmount
Copy link
Contributor Author

fmount commented Jan 20, 2025

@jistr @frenzyfriday can we land this patch?

@openshift-merge-bot openshift-merge-bot bot merged commit 1f02da4 into openstack-k8s-operators:main Jan 20, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants