Backup requests for replicated storage #7

ericox · 2019-03-11T04:18:09Z

This is an implementation of client backup requests for replicated storage. The set of primary and backup requests are now marked with a new application-level requestID, a constant across the group. Additionally, the set of "other" backup hosts that may be working on a given read is also added to rpc requests so when it comes time to cancel work, a tractserver knows which host(s) to send cancellation rpcs to. The work for tractserver-to-tractserver cancellation still needs to done using this new request ID in a later PR.

NOTE: error reporting code was removed in readOneTractReplicated and readOneTractRS. Added TODOs for when we redo this logic.

client/blb/client.go

internal/core/genreqid.go

dnr · 2019-03-11T20:41:29Z

client/blb/client_test.go

+		}
+	}()
+
+	bch <- time.Time{}


part the idea of using channels is so you can separate out the calls and step them one by one. so you could send one Time value to unblock the first one, then sleep briefly and check what the first request did, then unblock the second, etc.

ideally we should test the requests being answered in different orders. so I think we probably need multiple channels so we can give a separate one to each request.

(the brief sleep is kind of bad but maybe not too bad. as you say, we could add more synchronization points but it gets kind of messy.

client/blb/client.go

client/blb/backup_reads.go

client/blb/client.go

dnr

I think the read logic is definitely easier to understand now. There's probably some more cleanup that could be done eventually, but we can think more about tests now.

There's a bunch of cases that should be accounted for: first request returns success before backup request, first request returns error before backup request returns success, backup request returns success before first request, backup request returns error before first request returns success, backup request returns error before first request returns error and it goes on to a third sequential request, and probably more than I'm not thinking of at the moment.

dnr · 2019-03-31T01:22:22Z

client/blb/client.go

+		done := false
+		var res tractResultRepl
+		for n := 0; n < nReaders; n++ {
+			res = <-resultCh
 			if res.err == core.NoError {


you probably want this to be if res.err == core.NoError && !done to avoid taking two successes (very unlikely because of the cancel, but just in case)

dnr · 2019-03-31T01:24:40Z

client/blb/client.go

 		if res.err != core.NoError && res.err != core.ErrEOF {
 			continue
 		}
-		*result = res
+		*result = res.tractResult


It'd be nice to consolidate the copy, assignment to *result, and zero-filling the extra with the part above. If it seems tricky, we can leave that until later.

dnr · 2019-03-31T01:28:23Z

client/blb/client.go

+		done := false
+		var res tractResultRepl
+		for n := 0; n < nReaders; n++ {
+			res = <-resultCh


You can just do res := <-resultCh, right? No need to pre-declare it.

dnr · 2019-03-31T01:29:30Z

client/blb/client.go

 				*result = res.tractResult
-				copy(thisB, res.thisB)
-				return
+				copy(thisB[:len(res.thisB)], res.thisB)


just copy(thisB, res.thisB). copy will only copy as much as there's room for.

dnr

In general I think these hooks for testing will work pretty well. It's possible you'll have to tweak the placement. You'll probably want to make a generic backupRequestFunc for testing. More comments on the code...

dnr · 2019-04-11T20:27:31Z

client/blb/client.go

+
+// vars for test injection
+var (
+	randOrder         = getRandomOrder


can just write randOrder = rand.Perm

dnr · 2019-04-11T20:31:55Z

client/blb/client.go

+	host string,
+	reqID string,
+	tract *core.TractInfo,
+	thisB []byte,


It's weird to pass in a slice without using it at all. I know we may have to pass in the slice later if we want to try removing the allocation/copy, but the code should be as clean as possible at this point in time, so we should pass just the length now, and change it back later if we need to.

dnr · 2019-04-11T20:33:32Z

client/blb/client.go

+	thisOffset int64,
+	order []int,
+	n int,
+	resultCh chan tractResultRepl) {


put a direction restriction on the chan type if you can

Tried to put a direction restriction where the compiler would allow.

dnr · 2019-04-11T21:07:19Z

client/blb/client.go

+	randOrder         = getRandomOrder
+	backupRequestFunc = doParallelBackupReads
+	readDone          = func() {} // call when read/readAt rpcs return.
+	backupPhaseDone   = func() {} // call when the entire backup read phase is done.


These are only used in tests, but it still feels better to put them in the Client rather than globals. Also maybe put "Hook" in the name so the semantics are more clear (that they're just hooks for injecting test synchronization).

dnr · 2019-04-11T21:12:30Z

client/blb/client_test.go

-	return nil
+	randOrder = func(n int) []int {
+		order := make([]int, n)
+		for i := 0; i < n; i++ {


can write for i := range order

dnr · 2019-04-11T21:26:56Z

client/blb/client_test.go

+	// Does the first request succeed before backup is sent work?
+	sendPrimary := make(chan bool)
+	sendBackup := make(chan bool)
+	backupRequestFunc = func(


It might be annoying to have to write a new implementation of this for each test.. I wonder if there's a generic version that can be reused? You can try just writing a few tests and see what commonalities fall out.

dnr · 2019-04-11T21:29:24Z

client/blb/client.go

+	thisB []byte,
+	thisOffset int64,
+	order []int,
+	n int,


could pass in order[n] instead of both of them?

dnr · 2019-04-11T21:30:29Z

client/blb/client.go

+		resultCh <- tractResultRepl{
+			tractResult: tractResult{0, 0, core.ErrCanceled, badVersionHost},
+		}
+	case <-cli.backupReadState.delayFunc(time.Duration(n) * delay):


could pass in the delay value instead of n? or possibly even the timer channel itself? then you might not need delayFunc at all, since you're going to override the spawning logic.

Are you thinking that we override spawning logic at the layer below by putting readOneTractWithResult calls in different goroutines? Or if we get rid of the bch channel in the test overrides, we could pass in a time.After(0) to each and control execution by outer channels as I wrote in the tests.

I was thinking the latter.. pass in 0 or time.After(0) for all the calls in your backupRequestFunc implementation and control things by the order that you spawn them.

Passing the delay channel seemed to work better for testing. Went with that over time.After(0).

dnr · 2019-04-11T21:31:11Z

client/blb/client.go

+	var badVersionHost string
+	host := tract.Hosts[order[n]]
+	if host == "" {
+		log.V(1).Infof("read %s from tsid %d: no host", tract.Tract, tract.TSIDs[n])


tract.Hosts and tract.TSIDs are parallel, right? so using n here is wrong, it should be order[n]?

dnr · 2019-04-11T21:33:11Z

client/blb/client_test.go

-	// read completes, the other two are cancelled. Perhaps we need to mock the cancel func
-	// as well to disable cancellation?
+	// release fake sleep
+	sendPrimary <- true
 	bch <- time.Time{}


yeah, I'm thinking you don't need delayFunc/bch after all

ericox · 2019-04-17T02:52:27Z

Addressed last batch of comments, and cleaned up the tests.

dnr · 2019-04-19T16:43:32Z

client/blb/client_test.go

-	cancelDone = func() {}
+// restoreTestFuncs restores the overrides created in setupBackupClient so other
+// tests not relevant to backups work as is.
+func (cli *Client) restoreTestFuncs() {


I don't think we should have a function like this. Tests should just make new Clients.

dnr · 2019-04-19T16:53:26Z

client/blb/client_test.go

 	fail := func(e tsTraceEntry) core.Error {
-		// Reads from the second tractserver fails.
+		// Reads from the first tractserver fails.


dnr

Sorry for the delay. I made a bunch of notes on small things that can be cleaned up, but nothing major.

After that, I think what I'll do is merge it into a branch in this repo (since it is technically broken without the reportbadts stuff) and work on restoring that, then we can merge it to master. You can work on cross-ts cancellation on that branch.

dnr · 2019-05-02T02:17:26Z

client/blb/client.go

+	resultCh chan<- tractResultRepl,
+	nOrder int) {
+	err := core.ErrAllocHost // default error if none present
+	var badVersionHost string


unused here?

dnr · 2019-05-02T02:18:52Z

client/blb/client.go

+	delayTimer <-chan time.Time,
+	resultCh chan<- tractResultRepl,
+	nOrder int) {
+	err := core.ErrAllocHost // default error if none present


this is only used on one place below, you can just inline it

dnr · 2019-05-02T02:30:01Z

client/blb/client.go

@@ -1111,46 +1241,70 @@ func (cli *Client) readOneTractReplicated(
 	thisB []byte,
 	thisOffset int64) {

+	// TODO(eric): remove this var when we redo bad ts version reporting
 	var badVersionHost string


dnr · 2019-05-02T23:03:46Z

client/blb/client_test.go

 // newClient creates a Client suitable for testing. The trace function given
 // should return core.NoError for a read/write to proceed, and something else to
-// inject an error.
+// inject an error. The disableBackupReads parameter allows the backup read feature


comment change should be reverted

dnr · 2019-05-02T23:10:57Z

client/blb/client_test.go

+	checkWrite(t, blob, p1)
+}
+
+// testRead tests a single read at a given length and offset.


this doesn't use len (should be length) or off, or done.

dnr · 2019-05-02T23:23:40Z

client/blb/client_test.go

+// setupBackupClient initializes a client that has backup requests enabled.
+// it also overrides synchronization hooks on the client that allow for
+// contol of read behavior operation ordering.
+func (cli *Client) setupBackupClient(maxNumBackups int, overrideDelay bool) (<-chan bool, <-chan bool, <-chan bool) {


overrideDelay is never false, so it probably shouldn't exist. It seems like this isn't really useful without it, no?

dnr · 2019-05-02T23:31:44Z

client/blb/client_test.go

+// setupBackupRequestFunc overrides spawning logic for backup reads, it will spawn nReaders
+// goroutines that will block on respective delayChans. To release a reader one can write to the
+// delayChan.
+func setupBackupRequestFunc(cli *Client, nReaders int) []chan time.Time {


also, this is always used in conjunction with setupBackupClient, so they should probably be one function

dnr · 2019-05-02T23:44:25Z

client/blb/client_test.go

+	<-cancelCh
+	<-bdone
+	<-readDone
+}


should this (and all the following tests) check the trace log too? how do they tell that the right thing is happening?

dnr · 2019-05-02T23:45:45Z

client/blb/client_test.go

+	testWrite(t, blob, core.TractLength, core.TractLength)
+
+	// Test one request per host, first request finishes
+	rdone, bdone, _ := cli.setupBackupClient(2, true) // 2 backup requests


2 here means 1 backup request, right?

dnr · 2019-05-02T23:46:50Z

client/blb/client_test.go

+	<-rdone
+	<-readDone
+
+	// Do we fallback if the backup request returns an error before the first one


why are there two tests in this function? shouldn't they be isolated?

ericox · 2019-05-07T00:39:54Z

Sorry for the delay. I made a bunch of notes on small things that can be cleaned up, but nothing major.

After that, I think what I'll do is merge it into a branch in this repo (since it is technically broken without the reportbadts stuff) and work on restoring that, then we can merge it to master. You can work on cross-ts cancellation on that branch.

Thanks! Will get to this soon.

Eric Cox and others added 7 commits January 29, 2019 22:15

request id

04bf5c2

plumbing for reqID in client

a4a6d00

fix client compile bugs

0349476

add cutSlice helper and plumb otherHosts through rpc

f1413ea

disable backups

d46fb6e

configurable client backup reads

4bbeae9

add read check

c6096ed

ericox requested a review from dnr March 11, 2019 04:18

ericox changed the title ~~Backup requests replicated~~ Backup requests for replicated storage Mar 11, 2019

rm print

974bc72

dnr requested changes Mar 11, 2019

View reviewed changes

ericox added 4 commits March 15, 2019 23:13

cleanup backup read logic

288ae14

merge two read funcs

3243231

review feedback

86b5d4c

comments

a07e2f7

ericox commented Mar 17, 2019

View reviewed changes

client/blb/client.go Outdated Show resolved Hide resolved

move bad ts report logic

bf10f9d

ericox commented Mar 23, 2019

View reviewed changes

client/blb/client.go Outdated Show resolved Hide resolved

ericox added 3 commits March 23, 2019 11:26

cleanup orderCh

00de983

cleanup and remove error reporting for now

4d78d65

rm reporting

4c8ca12

dnr reviewed Mar 26, 2019

View reviewed changes

ericox added 2 commits March 28, 2019 22:25

backup error logic

3adde22

pull logic from fallback loop into helper

dfa1952

dnr reviewed Mar 29, 2019

View reviewed changes

ericox added 3 commits March 29, 2019 23:26

review feedback

2af38ac

review feedback

ab39a54

move select into helper

84540ed

dnr reviewed Mar 31, 2019

View reviewed changes

refactor again

571ed3c

ericox added 2 commits April 6, 2019 22:04

backup test scheme + refactor

7ae0645

adding tests

edfc8db

dnr reviewed Apr 11, 2019

View reviewed changes

ericox requested a review from dnr April 17, 2019 02:17

cleanup and tests

fce1cca

dnr reviewed Apr 19, 2019

View reviewed changes

dnr reviewed May 2, 2019

View reviewed changes

Backup requests for replicated storage #7

Are you sure you want to change the base?

Backup requests for replicated storage #7

Conversation

ericox commented Mar 11, 2019 • edited Loading

Choose a reason for hiding this comment

dnr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericox Apr 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericox commented Apr 17, 2019

dnr Apr 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericox commented May 7, 2019

ericox commented Mar 11, 2019 •

edited

Loading

ericox Apr 16, 2019 •

edited

Loading

dnr Apr 19, 2019 •

edited

Loading