-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robustness: make qps below threshold retriable test. #17725
Conversation
Signed-off-by: Siyuan Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea - nice work @siyuanfoundation
Edit: On marginal hardware the retries might just shift the problem to timeout of completing robustness within 30mins/200mins but I still think it's better than current situation where one single performance flake throws off entire robustness run.
cc @serathius @ahrtr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't like the idea, if we can get the performance required on github action, we should be able to get it on Prow. If you look at the recent feailures in github actions they are not directly related to low qps, it's just consequence of other issue.
This is a separate, unrelated problem. Robustness test reports are written always to the same directory dependent on the name of the test. This means that we cannot have multiple failures in run as it would override the previous result. I'm open to changing it, if you think this is a problem, but I don't think that stopping on the first failure is bad. At least as long we maintain number of failures low, and there is no sense to have robustness test with high failure rate. |
It's not necessarily that we can't get the qps, clearly later iterations on prow do. For some reason the first one doesn't. Putting that aside I still think this feature makes sense. Why wouldn't we make the test suite more resilient to these annoying performance flakes? This seems like a very sensible feature. |
Same like we don't slap retries on every test, we need to understand the underlying problem. I would need to understand what is the problem with Prow first. |
we don't slap retries on every test because most of the tests are written to detect real problems when it fails. Just from the error message of Here is my line of thought. If 1% of tests do not satisfy the qps min, is that really a problem? On the other hand, if we say 10% tests do not satisfy the qps min is a real problem, 3 retries put the single test run success rate at (1-0.1^3)=0.999, and for 100 runs, the probability of >=1 test failing is 1-0.999^100 = 10%. So in the end, you would still see the test failing 10% of the time. |
@@ -124,7 +153,7 @@ func (s testScenario) run(ctx context.Context, t *testing.T, lg *zap.Logger, clu | |||
maxRevisionChan := make(chan int64, 1) | |||
g.Go(func() error { | |||
defer close(maxRevisionChan) | |||
operationReport = traffic.SimulateTraffic(ctx, t, lg, clus, s.profile, s.traffic, finishTraffic, baseTime, ids) | |||
operationReport, retriableErr = traffic.SimulateTraffic(ctx, t, lg, clus, s.profile, s.traffic, finishTraffic, baseTime, ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think considering that this is the only place we are setting retriableErr
, this should be fine, but in the future if we set it in any of the other goroutines, that might be a race.
How about returning the err instead of nil
in this goroutine and catching it as part of g.Wait()
below?
t.Error(err) | ||
cancel() | ||
t.Fatal(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to understand this better, if we use t.Fatal
here, we won't end up retrying the test right?
Referring to the comment here: #17725 (comment)
I consider retrying InjectFailpoint an mistake, retrying this error doesn't really help, if the failpoint fails it will fail twice. The situation only improved after we found and fixed underlying issues with process handling code. The reason I introduced it was cause the issue was so rare (one every couple hundereds of invocations) I could not have reproduced it locally, so I wanted to slap a band aid. Don't think that was a good decision. |
No it's not as long we investigate the reason, track flakiness changes over time and not hide it behind retries. Retrying obscures the test results without understanding the underlying issue. Low qps not only is caused by infrastructure noise, but also by other underlying availability issues. Only by observing all the flakes, and categorizing them we can discover issues like #17455 Of cause categorizing issues in robustness test, brings additional toil. For now I'm happy to handle it, with plans to share the knowledge in the community via dedicated meetings, and potentially automating the process at some point by string matching. |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
For issue #17717
Errors like
Requiring minimal 200.000000 qps for test results to be reliable, got 72.143568 qps.
could be temporary. Proposing to make it retriable.