-
Notifications
You must be signed in to change notification settings - Fork 18
Step retries initial iteration #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@tfg13 @insomniacslk |
50b4a22
to
4edf494
Compare
Signed-off-by: Ilya <[email protected]>
Filter target events based on Retry number Unit tests for steps retry Signed-off-by: Ilya <[email protected]>
We changed behaviour of the runMonitor to keep targets input channel open any of the targets is using it. Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
Signed-off-by: Ilya <[email protected]>
b28bfbc
to
606aa41
Compare
pkg/event/testevent/test.go
Outdated
RunID types.RunID | ||
TestName string | ||
TestStepLabel string | ||
TestStepRetry int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to filter target statuses by last retry (otherwise we should put this logic to all reporters).
There are two ways:
- explicitly add retry (as currently done here)
- sort events by EmitTime and use windows based on "known" start/finish/error events.
I decided to go with the first one, as it is more explicit. But it requires more changes
But yeah you are right, writing plugins is too complicated. |
Thanks, I will make the proposal in a different PR |
Please start reviewing with PR 37, as it has the same concept, but contains less changes |
I'm also interested to find out why this "system pipeline" mechanism was written for test steps. As in, a single goro is created and receives targets in a channel. Might be some sequencing thing that escapes me at the moment. |
es, | ||
) | ||
} | ||
require.NoError(ts.T(), ts.stopServer(5*time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wouldve said that this 5s needed to be less than the 20s lower down, but if i look in the stopServer impl, the timeout isnt used at all
} | ||
|
||
// EmitTestEvent emits an event | ||
func EmitTestEvent(ctx xcontext.Context, event testevent.Event) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? apart from not checking allowed events, this is virtually identical to Emit()
tgs.tgt, tgs.CurStep, tgs.CurPhase, tgs.CurRetry, finished, resText) | ||
} | ||
|
||
// eventDistributor keeps track of retries in target state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this name is kinda confusing for me. The main method is Emit but this object is also "keeping track of retries in target state"? Im not sure i understand the business logic here
|
||
type TestStepRetryParameters struct { | ||
NumRetries int | ||
RetryInterval *xjson.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why ptr here? could just be defaulted to 0 and risk the null deref
} | ||
|
||
func (tr *TestRunner) awaitTargetResult(ctx xcontext.Context, tgs *targetState, ss *stepState) error { | ||
func (tr *TestRunner) awaitTargetResult(ctx xcontext.Context, tgs *targetState, ss *stepState) (error, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; in order to differentiate, i used a type rename in other diffs;
type outcome error
if !found { | ||
// this should never happen | ||
ctx.Errorf("Unknown target ID: '%s'", data.Target.ID) | ||
return ed.emitForUnknownTarget(ctx, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this should never happen, why does it still have en emit event handler (with invalid header because of the -1 retry) and not an error?
// This is fine, just need to unblock target handlers waiting on result from this step. | ||
for _, tgs := range tr.targets { | ||
if tgs.resCh != nil && tgs.CurStep == i { | ||
if !tgs.resChClosed && tgs.CurStep == i { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this extra bool needed, apart from the channel being null?
if res != nil { | ||
tgs.Res = xjson.NewError(res) | ||
tgs.CurRetry++ | ||
if tgs.CurRetry > ss.sb.RetryParameters.NumRetries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; this seems more like a MaxRetries, rather than NumRetries
a number of my comments on #37 also apply here; stuff related to retries |
Will have to rework the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just returning back to author queue
Waiting for: #48 |
I will rework everything and start a new PR. |
Mainly Deomid's changes in test runner + some bug fix and unit tests.
Added retry to emitted events and keep only target events with the latest retry (otherwise all reporters should be aware of retries).
Fixed old bugs:
TODO next:
Temporary disabled one unit test that helps to find incorrectly implemented test steps that do not reply to specific output channel as fixing it requires extra rework of Test Runner.(*)
Also found race condition in runMonitor: we can possibly hang if test will switch to the next step before runMonitor detects. (will fix in the next PR)
Support Retry field in the database (as it requires to change database (add new column)). Will make a separate review.
This review should be safe to land, though (*) is arguable. Now, when we retry steps, we either can't close input targets channel before all targets progress to the next step or we should relaunch test step. I personally don't like the idea of relaunching test step. Can discuss in the comments