Skip to content

Conversation

rihter007
Copy link
Contributor

@rihter007 rihter007 commented Nov 5, 2021

Mainly Deomid's changes in test runner + some bug fix and unit tests.
Added retry to emitted events and keep only target events with the latest retry (otherwise all reporters should be aware of retries).
Fixed old bugs:

  • race-condition during targets channel close in Test Runner;
  • TestSteps runner helper should stop accepting new targets on pause;

TODO next:
Temporary disabled one unit test that helps to find incorrectly implemented test steps that do not reply to specific output channel as fixing it requires extra rework of Test Runner.(*)
Also found race condition in runMonitor: we can possibly hang if test will switch to the next step before runMonitor detects. (will fix in the next PR)
Support Retry field in the database (as it requires to change database (add new column)). Will make a separate review.

This review should be safe to land, though (*) is arguable. Now, when we retry steps, we either can't close input targets channel before all targets progress to the next step or we should relaunch test step. I personally don't like the idea of relaunching test step. Can discuss in the comments

@rihter007 rihter007 requested a review from tfg13 November 5, 2021 19:59
@rihter007
Copy link
Contributor Author

@tfg13 @insomniacslk
I was also thinking of changing step plugin interface. Currently it consumes input/output target channels, internally launches goroutine for each target and sends result to an output channel.
Maybe we should change it to smth like: "func Run(ctx, inputTarget, emitter) (targetResult, error)" and launch goroutines for each target in test runner? IMHO it will put channels/goroutines complexity into a single place (yep, currently we have a helper for clients, but they still have flexibility to make mistakes)

rojer9-fb and others added 14 commits November 6, 2021 06:38
Signed-off-by: Ilya <[email protected]>
Filter target events based on Retry number
Unit tests for steps retry

Signed-off-by: Ilya <[email protected]>
We changed behaviour of the runMonitor to keep targets input channel open any of the targets is using it.

Signed-off-by: Ilya <[email protected]>
@rihter007 rihter007 force-pushed the feature_step_retries branch from b28bfbc to 606aa41 Compare November 6, 2021 03:38
RunID types.RunID
TestName string
TestStepLabel string
TestStepRetry int
Copy link
Contributor Author

@rihter007 rihter007 Nov 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to filter target statuses by last retry (otherwise we should put this logic to all reporters).
There are two ways:

  • explicitly add retry (as currently done here)
  • sort events by EmitTime and use windows based on "known" start/finish/error events.

I decided to go with the first one, as it is more explicit. But it requires more changes

@tfg13
Copy link
Collaborator

tfg13 commented Nov 8, 2021

@tfg13 @insomniacslk I was also thinking of changing step plugin interface. Currently it consumes input/output target channels, internally launches goroutine for each target and sends result to an output channel.
Not all plugins do this. We have some that gather all the targets (until input channel closed) and then run one bulk job with them. Would that still work then?

But yeah you are right, writing plugins is too complicated.

@rihter007
Copy link
Contributor Author

@tfg13 @insomniacslk I was also thinking of changing step plugin interface. Currently it consumes input/output target channels, internally launches goroutine for each target and sends result to an output channel.

Not all plugins do this. We have some that gather all the targets (until input channel closed) and then run one bulk job with them. Would that still work then?

But yeah you are right, writing plugins is too complicated.

Thanks, I will make the proposal in a different PR

@rihter007
Copy link
Contributor Author

Please start reviewing with PR 37, as it has the same concept, but contains less changes

@mimir-d
Copy link
Member

mimir-d commented Nov 15, 2021

@tfg13 @insomniacslk I was also thinking of changing step plugin interface. Currently it consumes input/output target channels, internally launches goroutine for each target and sends result to an output channel.

Not all plugins do this. We have some that gather all the targets (until input channel closed) and then run one bulk job with them. Would that still work then?
But yeah you are right, writing plugins is too complicated.

Thanks, I will make the proposal in a different PR

I'm also interested to find out why this "system pipeline" mechanism was written for test steps. As in, a single goro is created and receives targets in a channel. Might be some sequencing thing that escapes me at the moment.

es,
)
}
require.NoError(ts.T(), ts.stopServer(5*time.Second))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wouldve said that this 5s needed to be less than the 20s lower down, but if i look in the stopServer impl, the timeout isnt used at all

}

// EmitTestEvent emits an event
func EmitTestEvent(ctx xcontext.Context, event testevent.Event) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? apart from not checking allowed events, this is virtually identical to Emit()

tgs.tgt, tgs.CurStep, tgs.CurPhase, tgs.CurRetry, finished, resText)
}

// eventDistributor keeps track of retries in target state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name is kinda confusing for me. The main method is Emit but this object is also "keeping track of retries in target state"? Im not sure i understand the business logic here


type TestStepRetryParameters struct {
NumRetries int
RetryInterval *xjson.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ptr here? could just be defaulted to 0 and risk the null deref

}

func (tr *TestRunner) awaitTargetResult(ctx xcontext.Context, tgs *targetState, ss *stepState) error {
func (tr *TestRunner) awaitTargetResult(ctx xcontext.Context, tgs *targetState, ss *stepState) (error, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; in order to differentiate, i used a type rename in other diffs;

type outcome error

if !found {
// this should never happen
ctx.Errorf("Unknown target ID: '%s'", data.Target.ID)
return ed.emitForUnknownTarget(ctx, data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this should never happen, why does it still have en emit event handler (with invalid header because of the -1 retry) and not an error?

// This is fine, just need to unblock target handlers waiting on result from this step.
for _, tgs := range tr.targets {
if tgs.resCh != nil && tgs.CurStep == i {
if !tgs.resChClosed && tgs.CurStep == i {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this extra bool needed, apart from the channel being null?

if res != nil {
tgs.Res = xjson.NewError(res)
tgs.CurRetry++
if tgs.CurRetry > ss.sb.RetryParameters.NumRetries {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; this seems more like a MaxRetries, rather than NumRetries

@mimir-d
Copy link
Member

mimir-d commented Nov 16, 2021

a number of my comments on #37 also apply here; stuff related to retries

@rihter007
Copy link
Contributor Author

Will have to rework the PR.
It also requires some prior work of moving step running logic into a separate entity

Copy link
Member

@mimir-d mimir-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just returning back to author queue

@rihter007
Copy link
Contributor Author

Waiting for: #48

@rihter007
Copy link
Contributor Author

I will rework everything and start a new PR.

@rihter007 rihter007 closed this Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants