-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup requests for replicated storage #7
base: master
Are you sure you want to change the base?
Backup requests for replicated storage #7
Conversation
client/blb/client_test.go
Outdated
} | ||
}() | ||
|
||
bch <- time.Time{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part the idea of using channels is so you can separate out the calls and step them one by one. so you could send one Time value to unblock the first one, then sleep briefly and check what the first request did, then unblock the second, etc.
ideally we should test the requests being answered in different orders. so I think we probably need multiple channels so we can give a separate one to each request.
(the brief sleep is kind of bad but maybe not too bad. as you say, we could add more synchronization points but it gets kind of messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the read logic is definitely easier to understand now. There's probably some more cleanup that could be done eventually, but we can think more about tests now.
There's a bunch of cases that should be accounted for: first request returns success before backup request, first request returns error before backup request returns success, backup request returns success before first request, backup request returns error before first request returns success, backup request returns error before first request returns error and it goes on to a third sequential request, and probably more than I'm not thinking of at the moment.
client/blb/client.go
Outdated
done := false | ||
var res tractResultRepl | ||
for n := 0; n < nReaders; n++ { | ||
res = <-resultCh | ||
if res.err == core.NoError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want this to be if res.err == core.NoError && !done
to avoid taking two successes (very unlikely because of the cancel, but just in case)
client/blb/client.go
Outdated
if res.err != core.NoError && res.err != core.ErrEOF { | ||
continue | ||
} | ||
*result = res | ||
*result = res.tractResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to consolidate the copy, assignment to *result
, and zero-filling the extra with the part above. If it seems tricky, we can leave that until later.
client/blb/client.go
Outdated
done := false | ||
var res tractResultRepl | ||
for n := 0; n < nReaders; n++ { | ||
res = <-resultCh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just do res := <-resultCh
, right? No need to pre-declare it.
client/blb/client.go
Outdated
*result = res.tractResult | ||
copy(thisB, res.thisB) | ||
return | ||
copy(thisB[:len(res.thisB)], res.thisB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just copy(thisB, res.thisB)
. copy will only copy as much as there's room for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think these hooks for testing will work pretty well. It's possible you'll have to tweak the placement. You'll probably want to make a generic backupRequestFunc for testing. More comments on the code...
client/blb/client.go
Outdated
|
||
// vars for test injection | ||
var ( | ||
randOrder = getRandomOrder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just write randOrder = rand.Perm
client/blb/client.go
Outdated
host string, | ||
reqID string, | ||
tract *core.TractInfo, | ||
thisB []byte, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird to pass in a slice without using it at all. I know we may have to pass in the slice later if we want to try removing the allocation/copy, but the code should be as clean as possible at this point in time, so we should pass just the length now, and change it back later if we need to.
client/blb/client.go
Outdated
thisOffset int64, | ||
order []int, | ||
n int, | ||
resultCh chan tractResultRepl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a direction restriction on the chan type if you can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to put a direction restriction where the compiler would allow.
client/blb/client.go
Outdated
randOrder = getRandomOrder | ||
backupRequestFunc = doParallelBackupReads | ||
readDone = func() {} // call when read/readAt rpcs return. | ||
backupPhaseDone = func() {} // call when the entire backup read phase is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are only used in tests, but it still feels better to put them in the Client rather than globals. Also maybe put "Hook" in the name so the semantics are more clear (that they're just hooks for injecting test synchronization).
client/blb/client_test.go
Outdated
return nil | ||
randOrder = func(n int) []int { | ||
order := make([]int, n) | ||
for i := 0; i < n; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can write for i := range order
client/blb/client_test.go
Outdated
// Does the first request succeed before backup is sent work? | ||
sendPrimary := make(chan bool) | ||
sendBackup := make(chan bool) | ||
backupRequestFunc = func( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be annoying to have to write a new implementation of this for each test.. I wonder if there's a generic version that can be reused? You can try just writing a few tests and see what commonalities fall out.
client/blb/client.go
Outdated
thisB []byte, | ||
thisOffset int64, | ||
order []int, | ||
n int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could pass in order[n] instead of both of them?
client/blb/client.go
Outdated
resultCh <- tractResultRepl{ | ||
tractResult: tractResult{0, 0, core.ErrCanceled, badVersionHost}, | ||
} | ||
case <-cli.backupReadState.delayFunc(time.Duration(n) * delay): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could pass in the delay value instead of n? or possibly even the timer channel itself? then you might not need delayFunc at all, since you're going to override the spawning logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking that we override spawning logic at the layer below by putting readOneTractWithResult
calls in different goroutines? Or if we get rid of the bch channel in the test overrides, we could pass in a time.After(0) to each and control execution by outer channels as I wrote in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the latter.. pass in 0 or time.After(0) for all the calls in your backupRequestFunc implementation and control things by the order that you spawn them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing the delay channel seemed to work better for testing. Went with that over time.After(0).
client/blb/client.go
Outdated
var badVersionHost string | ||
host := tract.Hosts[order[n]] | ||
if host == "" { | ||
log.V(1).Infof("read %s from tsid %d: no host", tract.Tract, tract.TSIDs[n]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tract.Hosts and tract.TSIDs are parallel, right? so using n here is wrong, it should be order[n]?
client/blb/client_test.go
Outdated
// read completes, the other two are cancelled. Perhaps we need to mock the cancel func | ||
// as well to disable cancellation? | ||
// release fake sleep | ||
sendPrimary <- true | ||
bch <- time.Time{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm thinking you don't need delayFunc/bch after all
Addressed last batch of comments, and cleaned up the tests. |
cancelDone = func() {} | ||
// restoreTestFuncs restores the overrides created in setupBackupClient so other | ||
// tests not relevant to backups work as is. | ||
func (cli *Client) restoreTestFuncs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have a function like this. Tests should just make new Clients.
fail := func(e tsTraceEntry) core.Error { | ||
// Reads from the second tractserver fails. | ||
// Reads from the first tractserver fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I made a bunch of notes on small things that can be cleaned up, but nothing major.
After that, I think what I'll do is merge it into a branch in this repo (since it is technically broken without the reportbadts stuff) and work on restoring that, then we can merge it to master. You can work on cross-ts cancellation on that branch.
resultCh chan<- tractResultRepl, | ||
nOrder int) { | ||
err := core.ErrAllocHost // default error if none present | ||
var badVersionHost string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused here?
delayTimer <-chan time.Time, | ||
resultCh chan<- tractResultRepl, | ||
nOrder int) { | ||
err := core.ErrAllocHost // default error if none present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only used on one place below, you can just inline it
@@ -1111,46 +1241,70 @@ func (cli *Client) readOneTractReplicated( | |||
thisB []byte, | |||
thisOffset int64) { | |||
|
|||
// TODO(eric): remove this var when we redo bad ts version reporting | |||
var badVersionHost string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
// newClient creates a Client suitable for testing. The trace function given | ||
// should return core.NoError for a read/write to proceed, and something else to | ||
// inject an error. | ||
// inject an error. The disableBackupReads parameter allows the backup read feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment change should be reverted
checkWrite(t, blob, p1) | ||
} | ||
|
||
// testRead tests a single read at a given length and offset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't use len (should be length) or off, or done.
// setupBackupClient initializes a client that has backup requests enabled. | ||
// it also overrides synchronization hooks on the client that allow for | ||
// contol of read behavior operation ordering. | ||
func (cli *Client) setupBackupClient(maxNumBackups int, overrideDelay bool) (<-chan bool, <-chan bool, <-chan bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overrideDelay is never false, so it probably shouldn't exist. It seems like this isn't really useful without it, no?
// setupBackupRequestFunc overrides spawning logic for backup reads, it will spawn nReaders | ||
// goroutines that will block on respective delayChans. To release a reader one can write to the | ||
// delayChan. | ||
func setupBackupRequestFunc(cli *Client, nReaders int) []chan time.Time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this is always used in conjunction with setupBackupClient, so they should probably be one function
<-cancelCh | ||
<-bdone | ||
<-readDone | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this (and all the following tests) check the trace log too? how do they tell that the right thing is happening?
testWrite(t, blob, core.TractLength, core.TractLength) | ||
|
||
// Test one request per host, first request finishes | ||
rdone, bdone, _ := cli.setupBackupClient(2, true) // 2 backup requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 here means 1 backup request, right?
<-rdone | ||
<-readDone | ||
|
||
// Do we fallback if the backup request returns an error before the first one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are there two tests in this function? shouldn't they be isolated?
Thanks! Will get to this soon. |
This is an implementation of client backup requests for replicated storage. The set of primary and backup requests are now marked with a new application-level requestID, a constant across the group. Additionally, the set of "other" backup hosts that may be working on a given read is also added to rpc requests so when it comes time to cancel work, a tractserver knows which host(s) to send cancellation rpcs to. The work for tractserver-to-tractserver cancellation still needs to done using this new request ID in a later PR.
NOTE: error reporting code was removed in readOneTractReplicated and readOneTractRS. Added TODOs for when we redo this logic.