Step retries initial iteration #33

rihter007 · 2021-11-05T19:59:25Z

Mainly Deomid's changes in test runner + some bug fix and unit tests.
Added retry to emitted events and keep only target events with the latest retry (otherwise all reporters should be aware of retries).
Fixed old bugs:

race-condition during targets channel close in Test Runner;
TestSteps runner helper should stop accepting new targets on pause;

TODO next:
Temporary disabled one unit test that helps to find incorrectly implemented test steps that do not reply to specific output channel as fixing it requires extra rework of Test Runner.(*)
Also found race condition in runMonitor: we can possibly hang if test will switch to the next step before runMonitor detects. (will fix in the next PR)
Support Retry field in the database (as it requires to change database (add new column)). Will make a separate review.

This review should be safe to land, though (*) is arguable. Now, when we retry steps, we either can't close input targets channel before all targets progress to the next step or we should relaunch test step. I personally don't like the idea of relaunching test step. Can discuss in the comments

rihter007 · 2021-11-05T20:41:34Z

@tfg13 @insomniacslk
I was also thinking of changing step plugin interface. Currently it consumes input/output target channels, internally launches goroutine for each target and sends result to an output channel.
Maybe we should change it to smth like: "func Run(ctx, inputTarget, emitter) (targetResult, error)" and launch goroutines for each target in test runner? IMHO it will put channels/goroutines complexity into a single place (yep, currently we have a helper for clients, but they still have flexibility to make mistakes)

Signed-off-by: Ilya <[email protected]>

Filter target events based on Retry number Unit tests for steps retry Signed-off-by: Ilya <[email protected]>

We changed behaviour of the runMonitor to keep targets input channel open any of the targets is using it. Signed-off-by: Ilya <[email protected]>

Signed-off-by: Ilya <[email protected]>

rihter007 · 2021-11-06T19:45:09Z

pkg/event/testevent/test.go

 	RunID         types.RunID
 	TestName      string
 	TestStepLabel string
+	TestStepRetry int


We need to filter target statuses by last retry (otherwise we should put this logic to all reporters).
There are two ways:

explicitly add retry (as currently done here)

sort events by EmitTime and use windows based on "known" start/finish/error events.

I decided to go with the first one, as it is more explicit. But it requires more changes

tfg13 · 2021-11-08T17:19:18Z

@tfg13 @insomniacslk I was also thinking of changing step plugin interface. Currently it consumes input/output target channels, internally launches goroutine for each target and sends result to an output channel.
Not all plugins do this. We have some that gather all the targets (until input channel closed) and then run one bulk job with them. Would that still work then?

But yeah you are right, writing plugins is too complicated.

rihter007 · 2021-11-11T11:59:03Z

@tfg13 @insomniacslk I was also thinking of changing step plugin interface. Currently it consumes input/output target channels, internally launches goroutine for each target and sends result to an output channel.

Not all plugins do this. We have some that gather all the targets (until input channel closed) and then run one bulk job with them. Would that still work then?

But yeah you are right, writing plugins is too complicated.

Thanks, I will make the proposal in a different PR

rihter007 · 2021-11-14T14:33:16Z

Please start reviewing with PR 37, as it has the same concept, but contains less changes

mimir-d · 2021-11-15T23:37:43Z

@tfg13 @insomniacslk I was also thinking of changing step plugin interface. Currently it consumes input/output target channels, internally launches goroutine for each target and sends result to an output channel.

Not all plugins do this. We have some that gather all the targets (until input channel closed) and then run one bulk job with them. Would that still work then?
But yeah you are right, writing plugins is too complicated.

Thanks, I will make the proposal in a different PR

I'm also interested to find out why this "system pipeline" mechanism was written for test steps. As in, a single goro is created and receives targets in a channel. Might be some sequencing thing that escapes me at the moment.

mimir-d · 2021-11-16T00:32:15Z

tests/e2e/e2e_test.go

 			es,
 		)
 	}
-	require.NoError(ts.T(), ts.stopServer(5*time.Second))


i wouldve said that this 5s needed to be less than the 20s lower down, but if i look in the stopServer impl, the timeout isnt used at all

mimir-d · 2021-11-16T00:38:16Z

pkg/storage/events.go

 }
+
+// EmitTestEvent emits an event
+func EmitTestEvent(ctx xcontext.Context, event testevent.Event) error {


why is this needed? apart from not checking allowed events, this is virtually identical to Emit()

mimir-d · 2021-11-16T00:40:53Z

pkg/runner/test_runner.go

+		tgs.tgt, tgs.CurStep, tgs.CurPhase, tgs.CurRetry, finished, resText)
+}
+
+// eventDistributor keeps track of retries in target state


this name is kinda confusing for me. The main method is Emit but this object is also "keeping track of retries in target state"? Im not sure i understand the business logic here

mimir-d · 2021-11-16T00:43:05Z

pkg/test/step.go


+type TestStepRetryParameters struct {
+	NumRetries    int
+	RetryInterval *xjson.Duration


why ptr here? could just be defaulted to 0 and risk the null deref

mimir-d · 2021-11-16T00:50:43Z

pkg/runner/test_runner.go

 }

-func (tr *TestRunner) awaitTargetResult(ctx xcontext.Context, tgs *targetState, ss *stepState) error {
+func (tr *TestRunner) awaitTargetResult(ctx xcontext.Context, tgs *targetState, ss *stepState) (error, error) {


nit; in order to differentiate, i used a type rename in other diffs;

type outcome error

mimir-d · 2021-11-16T00:52:46Z

pkg/runner/test_runner.go

+	if !found {
+		// this should never happen
+		ctx.Errorf("Unknown target ID: '%s'", data.Target.ID)
+		return ed.emitForUnknownTarget(ctx, data)


if this should never happen, why does it still have en emit event handler (with invalid header because of the -1 retry) and not an error?

mimir-d · 2021-11-16T00:58:18Z

pkg/runner/test_runner.go

 			// This is fine, just need to unblock target handlers waiting on result from this step.
 			for _, tgs := range tr.targets {
-				if tgs.resCh != nil && tgs.CurStep == i {
+				if !tgs.resChClosed && tgs.CurStep == i {


why is this extra bool needed, apart from the channel being null?

mimir-d · 2021-11-16T01:01:42Z

pkg/runner/test_runner.go

+		if res != nil {
+			tgs.Res = xjson.NewError(res)
+			tgs.CurRetry++
+			if tgs.CurRetry > ss.sb.RetryParameters.NumRetries {


nit; this seems more like a MaxRetries, rather than NumRetries

mimir-d · 2021-11-16T01:04:11Z

a number of my comments on #37 also apply here; stuff related to retries

rihter007 · 2021-11-21T18:46:26Z

Will have to rework the PR.
It also requires some prior work of moving step running logic into a separate entity

mimir-d

just returning back to author queue

rihter007 · 2021-11-26T22:11:42Z

Waiting for: #48

rihter007 · 2021-12-17T13:11:08Z

I will rework everything and start a new PR.

rihter007 requested a review from tfg13 November 5, 2021 19:59

rihter007 force-pushed the feature_step_retries branch from 50b4a22 to 4edf494 Compare November 5, 2021 22:36

rihter007 requested a review from insomniacslk November 5, 2021 22:36

rojer9-fb and others added 14 commits November 6, 2021 06:38

WIP

142921d

Signed-off-by: Ilya <[email protected]>

Add retry number to emitted events

6193ff5

Filter target events based on Retry number Unit tests for steps retry Signed-off-by: Ilya <[email protected]>

Temporary disable TestStepLosesTargets unit test.

b6d4a3c

We changed behaviour of the runMonitor to keep targets input channel open any of the targets is using it. Signed-off-by: Ilya <[email protected]>

Use time.Until instead of '.Sub(time.Now())'

507cd0b

Signed-off-by: Ilya <[email protected]>

Remove unused structure member

aa273d6

Signed-off-by: Ilya <[email protected]>

Fix integration test

5a09996

Signed-off-by: Ilya <[email protected]>

Protect teststep internals with Mutex

9d6cba3

Signed-off-by: Ilya <[email protected]>

Fix linter warning

079a521

Signed-off-by: Ilya <[email protected]>

Fix race condition

fc0351a

Signed-off-by: Ilya <[email protected]>

Fix integration test: add retry number in expected result

37527df

Signed-off-by: Ilya <[email protected]>

Fix e2e test: add retry number in expected result

3720b07

Signed-off-by: Ilya <[email protected]>

Fix race-condition in TestRunner when closing channel

bf68e31

Signed-off-by: Ilya <[email protected]>

Break waiting for input targets loop in teststeps on pause

ebc92be

Signed-off-by: Ilya <[email protected]>

Fix TestPauseResume e2e test

606aa41

Signed-off-by: Ilya <[email protected]>

rihter007 force-pushed the feature_step_retries branch from b28bfbc to 606aa41 Compare November 6, 2021 03:38

rihter007 commented Nov 6, 2021

View reviewed changes

Rename testevent.Header TestStepRetry into Retry

fda1eb8

mimir-d reviewed Nov 16, 2021

View reviewed changes

mimir-d requested changes Nov 23, 2021

View reviewed changes

rihter007 closed this Dec 17, 2021

Step retries initial iteration #33

Step retries initial iteration #33

Uh oh!

Conversation

rihter007 commented Nov 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rihter007 commented Nov 5, 2021

Uh oh!

rihter007 Nov 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tfg13 commented Nov 8, 2021

Uh oh!

rihter007 commented Nov 11, 2021

Uh oh!

rihter007 commented Nov 14, 2021

Uh oh!

mimir-d commented Nov 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimir-d commented Nov 16, 2021

Uh oh!

rihter007 commented Nov 21, 2021

Uh oh!

mimir-d left a comment

Choose a reason for hiding this comment

Uh oh!

rihter007 commented Nov 26, 2021

Uh oh!

rihter007 commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rihter007 commented Nov 5, 2021 •

edited

Loading

rihter007 Nov 6, 2021 •

edited

Loading