-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements for MG collections #11161
Improvements for MG collections #11161
Conversation
c368dd8
to
eabf343
Compare
Trying to verify here: |
eabf343
to
752cfd4
Compare
New verification triggered here: |
0befd70
to
bfe7cf7
Compare
Verification job: |
2b07848
to
a272f62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Verification job: |
a272f62
to
8d56cdd
Compare
"mcg", | ||
"purple_squad", | ||
} | ||
# For every failure in MG we are trying to extend next attempt by 20 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be up to 2 hours wait (20 min + 40 min + 60 min), in case the default max_mg_fail_attempts
is being used.
Isn't it too much time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We see sometime it's really not enough the 60 mins - so I am giving a chance to try to give more time to collect logs - to let us analyze and check MG if we have anything. If it will fail 3 times it will not do anything more which is still better than previously on example run it spent more than 24 hours on collecting logs which always time outed and we didn't get any log. So I was thinking that this increased time might help us to maybe get some logs from MG to identify the issue why it's taking longer than used to. We can reduce number of failed MG count to 2 only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as we don't try again to collect in any other later test case, except for once before teardown, this should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we reach our max attempts we do not collect it again and skipping whole collection.
max_mg_fail_attempts = config.REPORTING.get("max_mg_fail_attempts") | ||
if skip_after_max_fail: | ||
with mg_lock: | ||
if mg_fail_count > max_mg_fail_attempts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we deleting the MG dir structure in case of timeout failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's still contain some useful data produced by MG even if it time out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we maybe delete the dir structure of failed MG collection, in case MG collection succeeds in a later attempt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a directory for every failed test case - we are not doing any retry for one failed connection we just continue to next test case - and next time another failure occur we are trying collect mg for another test failure. So new directory is created for MG and old one still have some valuable information - also only one of MG can fail - OCP or ODF - and we also collecting Noobaa logs to same directory - so really nothing to delete as all collected data are valuable if we have some.
New verification: |
Verification job: |
ad1d736
to
cd5d9ca
Compare
20c0450
to
c71a8a8
Compare
Verification job: https://url.corp.redhat.com/05642f2 |
Fixes: red-hat-storage#10526 Fixes: red-hat-storage#11159 Several improvements in MG logs like preventing running MG over and over when it's still failing or getting timeouted. Collecting OCP logs for Ecosystem tests like upgrade decorated with purple squad. Not collect logs again at the end of execution on success when it was collected at least once during execution by some failed test. Signed-off-by: Petr Balogh <[email protected]>
c71a8a8
to
a29670c
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ebenahar, OdedViner, petr-balogh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes: #10526
Fixes: #11159
Several improvements in MG logs like preventing running MG over and over
when it's still failing or getting timeouted.
Collecting OCP logs for Ecosystem tests like upgrade decorated with purple
squad.
Not collect logs again at the end of execution on success when it was
collected at least once during execution by some failed test.