Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: remove acceptReady from RawNode #130307

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Sep 8, 2024

This PR removes the possibility of getting Ready from RawNode and not accepting it. This API existed to serve concurrency concerns in the implementation of Node interface, which was removed in #131746.

Removing the "accept" phase from RawNode allows further cleanups in the supporting data structures (raftLog / unstable / etc) down the stack.

Conceptually, we do want separation between "get ready" and "accept ready", but in a different form:

  • The "get ready" should return general information about the outstanding updates. In the future, this role should be taken by the HasReady() method or its variants (specific to a part of RawNode, like the log storage). Today it only returns one "overall" bit.
  • The "accept ready" should be more granular / controllable in terms of how much work the app is willing to commit to. We are gradually moving in this direction, with methods like LogSnapshot() and SendMsgApp(). Today, the Ready() still returns all the work eagerly.

Related to #122131

@pav-kv pav-kv requested a review from a team as a code owner September 8, 2024 20:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv changed the title raft: remove accept Ready API raft: remove acceptReady API from RawNode Sep 8, 2024
@pav-kv pav-kv changed the title raft: remove acceptReady API from RawNode raft: remove acceptReady from RawNode Sep 8, 2024
@pav-kv pav-kv force-pushed the rm-accept-ready branch 2 times, most recently from 7a2650e to 71eb886 Compare September 8, 2024 21:13
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean up!
The second commit is a little difficult to review - a lot moved and changed at the same time. It would likely have been helpful to split it up: one commit that moves acceptReady right underneath readyWithoutAccept, then a second commit that merges them with only minimal changes, then a third commit that does the simplifications that seem to have happened.
I've reviewed the new code but without checking that it's functionally identical to the previous one.

pkg/raft/node.go Outdated Show resolved Hide resolved
pkg/raft/node.go Outdated Show resolved Hide resolved
pkg/raft/node.go Outdated Show resolved Hide resolved
@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 9, 2024

@tbg I split the last commit into 3 as you suggested.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 9, 2024

TestCommitPaginationWithAsyncStorageWrites fails on CI, though I wasn't able to repro with 20k stress runs. Don't know if it's related to the change (this should be a no-op), will audit it again.

pkg/raft/example_test.go Outdated Show resolved Hide resolved
}
}
r.msgs = nil
r.msgsAfterAppend = nil
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to double-check that this field is handled equivalently. Previously it was cleared between readyWithoutAccept and acceptReady, now (in the merged code) it is cleared at the end.

@tbg
Copy link
Member

tbg commented Sep 9, 2024

TestCommitPaginationWithAsyncStorageWrites fails on CI, though I wasn't able to repro with 20k stress runs. Don't know if it's related to the change (this should be a no-op), will audit it again.

Hmm, seems too suspiciously related. I'll also take a look tmrw.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - approving from a review point of view after your pending comments are addressed.
The test failure is suspicious.

@tbg
Copy link
Member

tbg commented Sep 10, 2024

Actually, pattern matching suggests that a flake like this makes some sense. Previously node would "grow" the readies more than now (where each ready is accepted). I think the diff had a comment about "this making tests more stable". Without having looked at the details, it's not too surprising that a change in node timings/internal semantics causes a flake in a node test.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 10, 2024

The test is inherently flaky. In this bit

// Third entry should not be returned to be applied until first entry's
// application is acknowledged.
drain := true
for drain {
select {
case rd := <-n.Ready():
for _, m := range rd.Messages {
require.NotEqual(t, raftpb.MsgStorageApply, m.Type, "unexpected message: %v", m)
}
case <-time.After(10 * time.Millisecond):
drain = false
}
}
// Acknowledged first entry application.
require.NoError(t, n.Step(ctx, applyResps[0]))
applyResps = applyResps[1:]
// Third entry now returned for application.
rd = readyWithTimeout(n)
require.Len(t, rd.Messages, 1)
m = rd.Messages[0]
require.Equal(t, raftpb.MsgStorageApply, m.Type)
require.Len(t, m.Entries, 1)
applyResps = append(applyResps, m.Responses[0])

Normally, the "drain" loop will see at least one Ready which moves the commit index to 4 (with a MsgStorageAppend). Then we expect that the next Ready returns the last entry 4 for application. So the expected Ready after the "drain" is:

raft2024/09/10 10:42:15 INFO: emitted ready: Ready MustSync=false:
CommittedEntries:
1/4 EntryNormal "aaa"
Messages:
1->ApplyThread MsgStorageApply Term:0 Log:0/0 Entries:[1/4 EntryNormal "aaa"] Responses:[
  ApplyThread->1 MsgStorageApplyResp Term:0 Log:0/0 Entries:[1/4 EntryNormal "aaa"]
]

but we see

raft2024/09/09 12:11:43 INFO: emitted ready: Ready MustSync=true:
HardState Term:1 Vote:1 Commit:4 Lead:1 LeadEpoch:1
CommittedEntries:
1/4 EntryNormal "aaa"
Messages:
1->AppendThread MsgStorageAppend Term:1 Log:0/0 Commit:4 Vote:1 Lead:1 LeadEpoch:1
1->ApplyThread MsgStorageApply Term:0 Log:0/0 Entries:[1/4 EntryNormal "aaa"] Responses:[
  ApplyThread->1 MsgStorageApplyResp Term:0 Log:0/0 Entries:[1/4 EntryNormal "aaa"]
]

That can happen if the "drain" loop gives up too fast (it allows only 10ms). I changed the timeout to 1 microsecond locally, and it reproes the thing (both on master and with this PR). I'll fix this flake in another PR, have an idea.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 10, 2024

Deflaked the test in this PR, for convenience.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 2, 2024

Will sequence this PR after #131746, so that tweaking the Node API is not necessary. Could have merged this one first, but neither is urgent.

Epic: none
Release note: none
Epic: none
Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants