-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: document raft Storage mental model #131041
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@tbg @nvanbenschoten I believe most of the text is true, but there might be gaps. Does this match your understanding of how |
// conventions, due to being an implementation of raft.Storage interface from a | ||
// different package. | ||
// | ||
// TODO(pav-kv): audit all the methods and figure out if the mu/raftMu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at Entries / TypedEntries
call, for example, they use fields from both r.raftMu
and r.mu
, which suggests both must be held. So this seems like a bug because Entries
can be called from any Step
, and there are a few Step
s that we do under mu
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we might need to do 2 implementations of this interface: one for raftMu
held (only, or together with mu
), and one for mu
held (only, or with raftMu
).
That would be at least one difference between the "regular" LogStorage
interface and the "snapshot" one that we're exploring in #130967.
// which is an un-contended "IO" mutex and is allowed to be held longer. Most | ||
// writes are extracted from RawNode while holding r.mu (in the Ready() loop), | ||
// and handed over to storage under r.raftMu. There are a few cases when CRDB | ||
// synthesizes the writes (e.g. during a range split / merge) under r.raftMu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check whether the synthesized writes hold only raftMu
or both
// which is an un-contended "IO" mutex and is allowed to be held longer. Most | ||
// writes are extracted from RawNode while holding r.mu (in the Ready() loop), | ||
// and handed over to storage under r.raftMu. There are a few cases when CRDB | ||
// synthesizes the writes (e.g. during a range split / merge) under r.raftMu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also log truncations are out of RawNode
's control (don't remember details, might be different between "decoupled" or regular truncations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truncations happen under raftMu
(either from command application stack, or the "decoupled" worker). They then update the log storage state (e.g. truncated state) under Replica.mu
after the batch is done (in both cases). Because of this ordering, if we lock mu
during truncation but before truncated state is updated, there is an in-flight write (deletion), i.e. guarantee (1) is not true; so Entries
calls need to synchronize with these writes.
Might be better to update the in-memory TruncatedState
first instead, and only then write the batch. Then the Replica.mu
-only readers will sync with it well, (1) will hold, and so they will be "snapshots" too.
But today, we can only achieve "snapshots" by holding raftMu
when dealing with log storage entries (FirstIndex
and Entries
calls). For mu
only, it should be acceptable that the log is truncated under its nose. I.e. when Storage.FirstIndex()
returns 10, we don't have a strict guarantee that the first index is indeed 10, it can grow. Need to reflect this in Storage
comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The truncated state update race is easily fixable though: just need to move it from post-apply to pre-apply handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// (1) There is a guarantee that everything RawNode reads from Storage has no | ||
// in-flight writes. Raft always reads state that it knows to be stable (meaning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correction: no pending "append" type writes, like ones initiated by RawNode
. There can be deletions/truncations though. We should be fine with that.
Alternatively, it would be nice to coordinate truncations through RawNode
too. First register an intent to truncate with RawNode
, and only then enact. Then the RawNode
wouldn't even need to have errors like ErrCompacted
and ErrUnavailable
in the Storage
calls - it would always know a valid range of log indices. FirstIndex/LastIndex
wouldn't even need to be an interface.
There is a related / slightly orthogonal TODO about synchronizing lazy truncations with Ready
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)
pkg/kv/kvserver/replica_raftstorage.go
line 65 at r1 (raw file):
These methods each contain the following comment:
requires that r.mu is held for writing because it requires exclusive access to r.mu.stateLoader
So they do require r.mu.Lock
to be held for writing.
pkg/kv/kvserver/replica_raftstorage.go
line 78 at r1 (raw file):
// whether the appended log slice is consistent with raft rules. // // (1) There is a guarantee that everything RawNode reads from Storage has no
We should clarify who is the one making this guarantee. RawNode may read from stable Storage while there are in-flight writes, but it will not read the value of those writes.
pkg/kv/kvserver/replica_raftstorage.go
line 103 at r1 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
TODO: explain why we need this "snapshotting" in the first place. We need it to apply updates in
Ready()
in a non-blocking manner, or doing other stuff on behalf of theRawNode
(like sendingMsgApp
s) that incurs IO.
While doing so, it would be helpful to explain the lifetime of these snapshots and how we ensure that the r.raftMu is held across that lifetime.
pkg/kv/kvserver/replica_raftstorage.go
line 104 at r1 (raw file):
methods assume that
We could also have them assert that, using syncutil.Mutex.AssertHeld
(and the corresponding RWMutex methods).
pkg/kv/kvserver/replica_raftstorage.go
line 111 at r1 (raw file):
and there are a few
Step
s that we do undermu
only
Will these specific Step
s ever access Storage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_raftstorage.go
line 111 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
and there are a few
Step
s that we do undermu
onlyWill these specific
Step
s ever accessStorage
?
A couple ones that I found recently:
- A
MsgAppResp
step in the snapshot sending code. AMsgAppResp
handler commonly tries to send follow-upMsgApp
s. In the snapshot sending scenario, I imagine it is common to follow-up with with the log entries after the snapshot. And they can be already in stable storage. - The
Campaign
method is sometimes called withmu
only. The stack then goes to here, where we scan all unapplied log entries. They can be stable too.
Epic: none Release note: none
9f80b75
to
05f73ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_raftstorage.go
line 65 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
These methods each contain the following comment:
requires that r.mu is held for writing because it requires exclusive access to r.mu.stateLoader
So they do require
r.mu.Lock
to be held for writing.
Made a not about this in the head comment.
pkg/kv/kvserver/replica_raftstorage.go
line 78 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We should clarify who is the one making this guarantee. RawNode may read from stable Storage while there are in-flight writes, but it will not read the value of those writes.
Done. The RawNode
guarantees that it won't read anything that has in-flight writes.
pkg/kv/kvserver/replica_raftstorage.go
line 104 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
methods assume that
We could also have them assert that, using
syncutil.Mutex.AssertHeld
(and the corresponding RWMutex methods).
Done (see a few added commits).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, 5 of 5 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)
pkg/kv/kvserver/replica_raft.go
line 2856 at r3 (raw file):
return 0, err } var totalSideloaded int64
nit: we no longer need this separate declaration.
Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
05f73ec
to
30c04cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_raft.go
line 2856 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: we no longer need this separate declaration.
Done
bors r=nvanbenschoten |
This PR documents the use of
Replica.{raftMu,mu}
mutexes inraft.{RawNode,Storage}
, and fixes a few minor things along the way.Part of #130955