Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for MG collections #11161

Conversation

petr-balogh
Copy link
Member

@petr-balogh petr-balogh commented Jan 17, 2025

Fixes: #10526
Fixes: #11159

Several improvements in MG logs like preventing running MG over and over
when it's still failing or getting timeouted.

Collecting OCP logs for Ecosystem tests like upgrade decorated with purple
squad.

Not collect logs again at the end of execution on success when it was
collected at least once during execution by some failed test.

@petr-balogh petr-balogh requested a review from a team as a code owner January 17, 2025 16:57
@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines label Jan 17, 2025
@petr-balogh petr-balogh force-pushed the improvmentes_for_mg_collections branch 3 times, most recently from c368dd8 to eabf343 Compare January 17, 2025 17:39
@petr-balogh
Copy link
Member Author

Trying to verify here:
https://url.corp.redhat.com/1f1ea06

@petr-balogh petr-balogh force-pushed the improvmentes_for_mg_collections branch from eabf343 to 752cfd4 Compare January 17, 2025 20:18
@petr-balogh
Copy link
Member Author

New verification triggered here:
https://url.corp.redhat.com/108c27f

@petr-balogh petr-balogh force-pushed the improvmentes_for_mg_collections branch 2 times, most recently from 0befd70 to bfe7cf7 Compare January 20, 2025 16:56
@petr-balogh
Copy link
Member Author

Verification job:
https://url.corp.redhat.com/95eab59

@petr-balogh petr-balogh force-pushed the improvmentes_for_mg_collections branch 2 times, most recently from 2b07848 to a272f62 Compare January 21, 2025 14:35
dahorak
dahorak previously approved these changes Jan 21, 2025
Copy link
Contributor

@dahorak dahorak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@petr-balogh
Copy link
Member Author

Verification job:
https://url.corp.redhat.com/d37eb8b

"mcg",
"purple_squad",
}
# For every failure in MG we are trying to extend next attempt by 20 minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be up to 2 hours wait (20 min + 40 min + 60 min), in case the default max_mg_fail_attempts is being used.
Isn't it too much time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We see sometime it's really not enough the 60 mins - so I am giving a chance to try to give more time to collect logs - to let us analyze and check MG if we have anything. If it will fail 3 times it will not do anything more which is still better than previously on example run it spent more than 24 hours on collecting logs which always time outed and we didn't get any log. So I was thinking that this increased time might help us to maybe get some logs from MG to identify the issue why it's taking longer than used to. We can reduce number of failed MG count to 2 only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we don't try again to collect in any other later test case, except for once before teardown, this should be fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we reach our max attempts we do not collect it again and skipping whole collection.

max_mg_fail_attempts = config.REPORTING.get("max_mg_fail_attempts")
if skip_after_max_fail:
with mg_lock:
if mg_fail_count > max_mg_fail_attempts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we deleting the MG dir structure in case of timeout failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's still contain some useful data produced by MG even if it time out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we maybe delete the dir structure of failed MG collection, in case MG collection succeeds in a later attempt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a directory for every failed test case - we are not doing any retry for one failed connection we just continue to next test case - and next time another failure occur we are trying collect mg for another test failure. So new directory is created for MG and old one still have some valuable information - also only one of MG can fail - OCP or ODF - and we also collecting Noobaa logs to same directory - so really nothing to delete as all collected data are valuable if we have some.

@petr-balogh
Copy link
Member Author

New verification:
https://url.corp.redhat.com/0f2103c

@petr-balogh
Copy link
Member Author

Verification job:
https://url.corp.redhat.com/0a180da

@petr-balogh petr-balogh force-pushed the improvmentes_for_mg_collections branch from ad1d736 to cd5d9ca Compare January 23, 2025 22:27
@petr-balogh petr-balogh requested a review from a team as a code owner January 23, 2025 22:27
@petr-balogh petr-balogh force-pushed the improvmentes_for_mg_collections branch 2 times, most recently from 20c0450 to c71a8a8 Compare January 31, 2025 16:52
@petr-balogh
Copy link
Member Author

Verification job: https://url.corp.redhat.com/05642f2

ebenahar
ebenahar previously approved these changes Feb 3, 2025
@openshift-ci openshift-ci bot added the lgtm label Feb 3, 2025
Fixes: red-hat-storage#10526
Fixes: red-hat-storage#11159

Several improvements in MG logs like preventing running MG over and over
when it's still failing or getting timeouted.

Collecting OCP logs for Ecosystem tests like upgrade decorated with purple
squad.

Not collect logs again at the end of execution on success when it was
collected at least once during execution by some failed test.

Signed-off-by: Petr Balogh <[email protected]>
Copy link

openshift-ci bot commented Feb 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ebenahar, OdedViner, petr-balogh

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-balogh petr-balogh merged commit f02c6e4 into red-hat-storage:master Feb 3, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm size/L PR that changes 100-499 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MG makes a lot of noise written to info log level OCP must gather is not collected
4 participants