Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: document raft Storage mental model #131041

Merged
merged 5 commits into from
Sep 24, 2024

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Sep 19, 2024

This PR documents the use of Replica.{raftMu,mu} mutexes in raft.{RawNode,Storage}, and fixes a few minor things along the way.

Part of #130955

Copy link

blathers-crl bot commented Sep 19, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@pav-kv pav-kv requested a review from tbg September 19, 2024 18:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 19, 2024

@tbg @nvanbenschoten I believe most of the text is true, but there might be gaps. Does this match your understanding of how mu/raftMu split works w.r.t. RawNode and Storage?

// conventions, due to being an implementation of raft.Storage interface from a
// different package.
//
// TODO(pav-kv): audit all the methods and figure out if the mu/raftMu
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at Entries / TypedEntries call, for example, they use fields from both r.raftMu and r.mu, which suggests both must be held. So this seems like a bug because Entries can be called from any Step, and there are a few Steps that we do under mu only.

Copy link
Collaborator Author

@pav-kv pav-kv Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we might need to do 2 implementations of this interface: one for raftMu held (only, or together with mu), and one for mu held (only, or with raftMu).

That would be at least one difference between the "regular" LogStorage interface and the "snapshot" one that we're exploring in #130967.

pkg/kv/kvserver/replica_raftstorage.go Outdated Show resolved Hide resolved
// which is an un-contended "IO" mutex and is allowed to be held longer. Most
// writes are extracted from RawNode while holding r.mu (in the Ready() loop),
// and handed over to storage under r.raftMu. There are a few cases when CRDB
// synthesizes the writes (e.g. during a range split / merge) under r.raftMu.
Copy link
Collaborator Author

@pav-kv pav-kv Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check whether the synthesized writes hold only raftMu or both

// which is an un-contended "IO" mutex and is allowed to be held longer. Most
// writes are extracted from RawNode while holding r.mu (in the Ready() loop),
// and handed over to storage under r.raftMu. There are a few cases when CRDB
// synthesizes the writes (e.g. during a range split / merge) under r.raftMu.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also log truncations are out of RawNode's control (don't remember details, might be different between "decoupled" or regular truncations).

Copy link
Collaborator Author

@pav-kv pav-kv Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncations happen under raftMu (either from command application stack, or the "decoupled" worker). They then update the log storage state (e.g. truncated state) under Replica.mu after the batch is done (in both cases). Because of this ordering, if we lock mu during truncation but before truncated state is updated, there is an in-flight write (deletion), i.e. guarantee (1) is not true; so Entries calls need to synchronize with these writes.

Might be better to update the in-memory TruncatedState first instead, and only then write the batch. Then the Replica.mu-only readers will sync with it well, (1) will hold, and so they will be "snapshots" too.

But today, we can only achieve "snapshots" by holding raftMu when dealing with log storage entries (FirstIndex and Entries calls). For mu only, it should be acceptable that the log is truncated under its nose. I.e. when Storage.FirstIndex() returns 10, we don't have a strict guarantee that the first index is indeed 10, it can grow. Need to reflect this in Storage comments.

Copy link
Collaborator Author

@pav-kv pav-kv Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The truncated state update race is easily fixable though: just need to move it from post-apply to pre-apply handler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 78 to 79
// (1) There is a guarantee that everything RawNode reads from Storage has no
// in-flight writes. Raft always reads state that it knows to be stable (meaning
Copy link
Collaborator Author

@pav-kv pav-kv Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correction: no pending "append" type writes, like ones initiated by RawNode. There can be deletions/truncations though. We should be fine with that.

Alternatively, it would be nice to coordinate truncations through RawNode too. First register an intent to truncate with RawNode, and only then enact. Then the RawNode wouldn't even need to have errors like ErrCompacted and ErrUnavailable in the Storage calls - it would always know a valid range of log indices. FirstIndex/LastIndex wouldn't even need to be an interface.

There is a related / slightly orthogonal TODO about synchronizing lazy truncations with Ready.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)


pkg/kv/kvserver/replica_raftstorage.go line 65 at r1 (raw file):
These methods each contain the following comment:

requires that r.mu is held for writing because it requires exclusive access to r.mu.stateLoader

So they do require r.mu.Lock to be held for writing.


pkg/kv/kvserver/replica_raftstorage.go line 78 at r1 (raw file):

// whether the appended log slice is consistent with raft rules.
//
// (1) There is a guarantee that everything RawNode reads from Storage has no

We should clarify who is the one making this guarantee. RawNode may read from stable Storage while there are in-flight writes, but it will not read the value of those writes.


pkg/kv/kvserver/replica_raftstorage.go line 103 at r1 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

TODO: explain why we need this "snapshotting" in the first place. We need it to apply updates in Ready() in a non-blocking manner, or doing other stuff on behalf of the RawNode (like sending MsgApps) that incurs IO.

While doing so, it would be helpful to explain the lifetime of these snapshots and how we ensure that the r.raftMu is held across that lifetime.


pkg/kv/kvserver/replica_raftstorage.go line 104 at r1 (raw file):

methods assume that

We could also have them assert that, using syncutil.Mutex.AssertHeld (and the corresponding RWMutex methods).


pkg/kv/kvserver/replica_raftstorage.go line 111 at r1 (raw file):

and there are a few Steps that we do under mu only

Will these specific Steps ever access Storage?

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_raftstorage.go line 111 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

and there are a few Steps that we do under mu only

Will these specific Steps ever access Storage?

A couple ones that I found recently:

  • A MsgAppResp step in the snapshot sending code. A MsgAppResp handler commonly tries to send follow-up MsgApps. In the snapshot sending scenario, I imagine it is common to follow-up with with the log entries after the snapshot. And they can be already in stable storage.
  • The Campaign method is sometimes called with mu only. The stack then goes to here, where we scan all unapplied log entries. They can be stable too.

Epic: none
Release note: none
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_raftstorage.go line 65 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

These methods each contain the following comment:

requires that r.mu is held for writing because it requires exclusive access to r.mu.stateLoader

So they do require r.mu.Lock to be held for writing.

Made a not about this in the head comment.


pkg/kv/kvserver/replica_raftstorage.go line 78 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should clarify who is the one making this guarantee. RawNode may read from stable Storage while there are in-flight writes, but it will not read the value of those writes.

Done. The RawNode guarantees that it won't read anything that has in-flight writes.


pkg/kv/kvserver/replica_raftstorage.go line 104 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

methods assume that

We could also have them assert that, using syncutil.Mutex.AssertHeld (and the corresponding RWMutex methods).

Done (see a few added commits).

@pav-kv pav-kv marked this pull request as ready for review September 23, 2024 11:23
@pav-kv pav-kv requested a review from a team as a code owner September 23, 2024 11:23
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 1 of 1 files at r2, 5 of 5 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)


pkg/kv/kvserver/replica_raft.go line 2856 at r3 (raw file):

		return 0, err
	}
	var totalSideloaded int64

nit: we no longer need this separate declaration.

Epic: none
Release note: none
Epic: none
Release note: none
Epic: none
Release note: none
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_raft.go line 2856 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: we no longer need this separate declaration.

Done

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 24, 2024

bors r=nvanbenschoten

@craig craig bot merged commit d26f61d into cockroachdb:master Sep 24, 2024
22 of 23 checks passed
@pav-kv pav-kv deleted the raft-storage-mental-model branch September 24, 2024 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants