Do not force fail ceph-mgr after reconfiguring OSDs #770

fmount · 2024-12-18T07:01:08Z

Jira: https://issues.redhat.com/browse/OSPCIX-557

fmount · 2024-12-18T12:21:13Z

Tested via: https://review.rdoproject.org/r/c/testproject/+/55643

Results:

Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed - force-failed a non responding ceph-mgr
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: not reached
Ceph migration: passed
Ceph migration: passed

tests/roles/ceph_migrate/tasks/fail_mgr.yaml

karelyatin · 2025-01-02T05:46:29Z

Seen a failure once with this patch https://logserver.rdoproject.org/41/55541/11/check/periodic-adoption-multinode-to-crc-ceph-1/3071151/controller/data-plane-adoption-tests-repo/data-plane-adoption/tests/logs/test_ceph_migration_out_2024-12-24T02:52:12EST.log

fmount · 2025-01-02T08:10:27Z

Seen a failure once with this patch https://logserver.rdoproject.org/41/55541/11/check/periodic-adoption-multinode-to-crc-ceph-1/3071151/controller/data-plane-adoption-tests-repo/data-plane-adoption/tests/logs/test_ceph_migration_out_2024-12-24T02:52:12EST.log

ack, thank you for the help with testing. I think there's another point in the code that I need to update with the same include_tasks (which is exactly where the failure you hit stops the execution). I'm updating this patch and start another round of rechecks.

tests/roles/ceph_migrate/tasks/mon.yaml

fmount · 2025-01-07T08:05:19Z

Looks like we can observe the new update working in the last rechecks.
I'll run a set of new tests to get any potential failure that invalidate the current work.

Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed - ceph-mgr was stuck and the check introduced by this patch has been executed.
Ceph migration: passed - ceph-mgr was stuck and the code of this patch has been executed.
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: passed
Ceph migration: not executed - Unrelated failure in the data-plane-adoption step

Signed-off-by: Francesco Pantano <[email protected]>

softwarefactory-project-zuul · 2025-01-09T20:37:44Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/f86ad3901edf4c4cba86c3eae942359c

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 3h 01m 27s
❌ adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 48m 07s

fmount · 2025-01-10T08:17:28Z

recheck

softwarefactory-project-zuul · 2025-01-10T11:25:08Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8a49b2e09ae9418d8ba653041ed0dffe

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 2h 59m 22s
❌ adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 47m 26s

fmount · 2025-01-12T08:29:09Z

recheck

softwarefactory-project-zuul · 2025-01-12T11:28:11Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/4e87dd22c89843f5bebc74ec3a915d62

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 2h 58m 07s
❌ adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 46m 41s

fmount · 2025-01-13T06:57:17Z

/retest

fmount · 2025-01-13T06:57:32Z

recheck

softwarefactory-project-zuul · 2025-01-13T09:59:45Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/7f98c552c6e3435d878c506ae938362a

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 3h 01m 05s
❌ adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 23m 22s

fmount · 2025-01-14T06:57:51Z

recheck

fmount · 2025-01-14T13:21:42Z

@jistr @frenzyfriday as per the (multiple) tests we run against this patch I think it's ready to merge. We observed (in the last ~20 runs) that the fail_mgr set of tasks allow to not hit the TIMEOUT issue due the ceph mgr fail command.
If you're ok with that, can you help landing this patch so we can start checking the periodic executions and (hopefully) solve the associated CIX?

frenzyfriday · 2025-01-15T09:38:49Z

/lgtm

ciecierski · 2025-01-15T10:58:41Z

/lgtm
/approve

openshift-ci · 2025-01-15T10:58:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ciecierski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ciecierski]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ciecierski · 2025-01-15T11:04:18Z

/lgtm

karelyatin · 2025-01-15T11:12:12Z

/lgtm
Just unclear why we didn't saw the same issue with similar downstream jobs, may be you know?

fmount · 2025-01-15T11:44:42Z

/lgtm Just unclear why we didn't saw the same issue with similar downstream jobs, may be you know?

I'm not entirely sure, but I suspect it's still the same bug we saw in TripleO in different circumstances. It might depend on the Ceph node size (mostly in terms of memory, if they differ), but I can't really confirm.

karelyatin · 2025-01-16T04:24:28Z

/lgtm Just unclear why we didn't saw the same issue with similar downstream jobs, may be you know?

I'm not entirely sure, but I suspect it's still the same bug we saw in TripleO in different circumstances. It might depend on the Ceph node size (mostly in terms of memory, if they differ), but I can't really confirm.
Ok yes could be, downstream i see using larger flavors 12vCPU/24GB Memory vs 8vCPU/16GB memory

fmount · 2025-01-20T10:32:04Z

@jistr @frenzyfriday can we land this patch?

fmount added the do-not-merge/hold label Dec 18, 2024

fmount force-pushed the ceph-mgr-issue branch 2 times, most recently from 9d112bc to 03ebc59 Compare December 20, 2024 09:53

katarimanojk reviewed Dec 20, 2024

View reviewed changes

tests/roles/ceph_migrate/tasks/fail_mgr.yaml Outdated Show resolved Hide resolved

fmount force-pushed the ceph-mgr-issue branch 2 times, most recently from f35d37f to eeae679 Compare December 20, 2024 20:28

fmount force-pushed the ceph-mgr-issue branch from eeae679 to ead0e65 Compare January 2, 2025 08:10

karelyatin reviewed Jan 2, 2025

View reviewed changes

tests/roles/ceph_migrate/tasks/mon.yaml Show resolved Hide resolved

fmount force-pushed the ceph-mgr-issue branch from ead0e65 to 39e631f Compare January 2, 2025 12:12

fmount force-pushed the ceph-mgr-issue branch from 39e631f to c71784d Compare January 9, 2025 12:06

fmount removed the do-not-merge/hold label Jan 9, 2025

fmount force-pushed the ceph-mgr-issue branch from c71784d to bdd9646 Compare January 9, 2025 17:19

Do not force fail ceph-mgr after reconfiguring OSDs

bdd9646

Signed-off-by: Francesco Pantano <[email protected]>

fmount requested review from jistr and frenzyfriday January 14, 2025 13:06

openshift-ci bot assigned frenzyfriday Jan 15, 2025

frenzyfriday approved these changes Jan 15, 2025

View reviewed changes

openshift-ci bot assigned ciecierski Jan 15, 2025

openshift-ci bot added the lgtm label Jan 15, 2025

openshift-ci bot assigned karelyatin Jan 15, 2025

fmount added the approved label Jan 20, 2025

openshift-merge-bot bot merged commit 1f02da4 into openstack-k8s-operators:main Jan 20, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not force fail ceph-mgr after reconfiguring OSDs #770

Do not force fail ceph-mgr after reconfiguring OSDs #770

fmount commented Dec 18, 2024 •

edited

Loading

fmount commented Dec 18, 2024 •

edited

Loading

karelyatin commented Jan 2, 2025

fmount commented Jan 2, 2025

fmount commented Jan 7, 2025 •

edited

Loading

softwarefactory-project-zuul bot commented Jan 9, 2025

fmount commented Jan 10, 2025

softwarefactory-project-zuul bot commented Jan 10, 2025

fmount commented Jan 12, 2025

softwarefactory-project-zuul bot commented Jan 12, 2025

fmount commented Jan 13, 2025

fmount commented Jan 13, 2025

softwarefactory-project-zuul bot commented Jan 13, 2025

fmount commented Jan 14, 2025

fmount commented Jan 14, 2025

frenzyfriday commented Jan 15, 2025

ciecierski commented Jan 15, 2025

openshift-ci bot commented Jan 15, 2025

ciecierski commented Jan 15, 2025

karelyatin commented Jan 15, 2025

fmount commented Jan 15, 2025

karelyatin commented Jan 16, 2025

fmount commented Jan 20, 2025

Do not force fail ceph-mgr after reconfiguring OSDs #770

Do not force fail ceph-mgr after reconfiguring OSDs #770

Conversation

fmount commented Dec 18, 2024 • edited Loading

fmount commented Dec 18, 2024 • edited Loading

karelyatin commented Jan 2, 2025

fmount commented Jan 2, 2025

fmount commented Jan 7, 2025 • edited Loading

softwarefactory-project-zuul bot commented Jan 9, 2025

fmount commented Jan 10, 2025

softwarefactory-project-zuul bot commented Jan 10, 2025

fmount commented Jan 12, 2025

softwarefactory-project-zuul bot commented Jan 12, 2025

fmount commented Jan 13, 2025

fmount commented Jan 13, 2025

softwarefactory-project-zuul bot commented Jan 13, 2025

fmount commented Jan 14, 2025

fmount commented Jan 14, 2025

frenzyfriday commented Jan 15, 2025

ciecierski commented Jan 15, 2025

openshift-ci bot commented Jan 15, 2025

ciecierski commented Jan 15, 2025

karelyatin commented Jan 15, 2025

fmount commented Jan 15, 2025

karelyatin commented Jan 16, 2025

fmount commented Jan 20, 2025

fmount commented Dec 18, 2024 •

edited

Loading

fmount commented Dec 18, 2024 •

edited

Loading

fmount commented Jan 7, 2025 •

edited

Loading