Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: refortify followers if they are not fortified #130628

Merged
merged 2 commits into from
Sep 27, 2024

Conversation

iskettaneh
Copy link
Contributor

@iskettaneh iskettaneh commented Sep 13, 2024

This commit does the following:

  1. The leader now skips sending fortification messages to followers whose stores don't provide support in the store-liveness-fabric.
  2. The leader now followers that needs fortification. It keeps checking on every heartbeat timeout.

Fixes: #125349

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/raft/rafttest/interaction_env_handler_add_nodes.go line 189 at r2 (raw file):

	}

	for i := 0; i < storeLivenessNodes; i++ {

This is ugly, but we sometimes add a number of nodes that are less than the number of voters for example. If we don't add the extra store-liveness nodes here, we trip into an out-of-bounds error in some tests.


pkg/raft/tracker/supporttracker.go line 67 at r1 (raw file):

// If we haven't tracked the follower's support epoch, it will return an epoch
// of 0 and the boolean will be false.
func (st *SupportTracker) GetFollowerSupportEpoch(id pb.PeerID) (pb.Epoch, bool) {

@arulajmani Do we usually have this pattern where the variable is not exported, but we export a function that gets it?

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't gone through the datadriven test file changes yet; first pass.

Reviewed 1 of 13 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh and @nvanbenschoten)


-- commits line 6 at r2:
See my comment below, but this change deserves to be in its own commit.


-- commits line 11 at r2:
nit: if you're referencing a Github issue you don't need this "Epic" field.


pkg/raft/raft.go line 773 at r2 (raw file):

// (including the leader itself) who don't support the leader at the current
// epoch. We skip sending a fortification message if we suspect a problem with
// the connection to the follower.

nit: instead of wording this in terms of "connection", let's reference store liveness in this comment instead.


pkg/raft/raft.go line 783 at r2 (raw file):

	r.trk.Visit(func(id pb.PeerID, _ *tracker.Progress) {
		curEpoch, _, ok := r.storeLiveness.SupportFrom(id)

Consider pulling this logic into a new maybeSendFortify method instead.


pkg/raft/raft.go line 1861 at r2 (raw file):

			} else {
				r.becomeLeader()
				r.maybeBcastFortify()

We're changing existing behaviour here, so we should pull this (and the affected datadriven tests) into a different commit.


pkg/raft/rafttest/interaction_env_handler_add_nodes.go line 189 at r2 (raw file):

Previously, iskettaneh wrote…

This is ugly, but we sometimes add a number of nodes that are less than the number of voters for example. If we don't add the extra store-liveness nodes here, we trip into an out-of-bounds error in some tests.

Can this be avoided if we say storeLivenessNodes = num-voters + num-learners?


pkg/raft/tracker/supporttracker.go line 67 at r1 (raw file):

Previously, iskettaneh wrote…

@arulajmani Do we usually have this pattern where the variable is not exported, but we export a function that gets it?

Yeah, this pattern looks good. How do you feel about renaming this to IsSuppotedBy instead?


pkg/raft/raft_paper_test.go line 125 at r2 (raw file):

			if storeLivenessEnabled {
				// Mock followers' support.
				r.supportTracker.RecordSupport(pb.PeerID(1), 1)

Instead of mocking this out like this, should we instead change the assertion condition below in the storeLivenessEnabled case? That way, we're testing your change here.

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits line 6 at r2:

Previously, arulajmani (Arul Ajmani) wrote…

See my comment below, but this change deserves to be in its own commit.

Done.


pkg/raft/raft.go line 783 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Consider pulling this logic into a new maybeSendFortify method instead.

Done.


pkg/raft/raft.go line 1861 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

We're changing existing behaviour here, so we should pull this (and the affected datadriven tests) into a different commit.

Done.


pkg/raft/rafttest/interaction_env_handler_add_nodes.go line 189 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Can this be avoided if we say storeLivenessNodes = num-voters + num-learners?

I thought of that, but I don't think it works because we actually have multiple add-nodes statements, and we keep the number of voters the same in them. Something like:

add-nodes 1 voters=(1 2 3)
...
add-nodes 2 voters=(1 2 3)

So the second statement would create another 3 store liveness nodes, which we don't want


pkg/raft/raft_paper_test.go line 125 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Instead of mocking this out like this, should we instead change the assertion condition below in the storeLivenessEnabled case? That way, we're testing your change here.

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 13 files at r1, 11 of 11 files at r3, 8 of 8 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @iskettaneh)


-- commits line 4 at r3:
s/fosending/sending/


pkg/raft/raft.go line 727 at r3 (raw file):

		return
	}
	curEpoch, _, ok := r.storeLiveness.SupportFrom(id)

This isn't checking that the follower is providing active support to the leader's store, just that it at some point was providing support. We need to cross reference the support expiration (the discarded variable) with the leader's clock. To do that, we'll need to use storeLiveness.SupportExpired.


pkg/raft/raft.go line 729 at r3 (raw file):

we don't know the follower's store liveness epoch

"the follower's support liveness epoch" isn't a well defined term, and even if it were, it probably would mean something different from what we want it to mean here.

How about something along the lines of: If the follower isn't providing active store liveness support to the leader, or it is but the leader isn't hearing about it, ...


pkg/raft/raft.go line 734 at r3 (raw file):

		// message once the connection is re-established.
		if id == r.id {
			// Log if the leader doesn't support itself in the liveness fabric.

Consider mentioning that this is possible if the leader's disk stalls.


pkg/raft/raft.go line 803 at r3 (raw file):

// maybeBcastFortify attempts to send an RPC to fortify the leader to all the
// peers (including the leader itself) who don't support the leader at the
// current epoch.

"the current epoch"

The leader actually has a different store liveness epoch for each remove store. Can we change this to "all the peers (including the leader itself) whose stores are currently providing store liveness support to the leader's store but who have not provided fortification support to the raft leader at the corresponding store liveness epoch." Or something along those lines.


pkg/raft/raft.go line 805 at r3 (raw file):

// current epoch.
func (r *raft) maybeBcastFortify() {
	if !r.storeLiveness.SupportFromEnabled() {

We're now checking this at three different levels. Is there a good way to clean that up? Should it only live in maybeSendFortify?


pkg/raft/raft.go line 1032 at r4 (raw file):

		}

		// Try to refortify any followers that don't currently support us.

Note to self / @arulajmani / @iskettaneh, as we start adding more logic around raft that's conditional on whether the fortification protocol is in use (defined by storeLiveness.SupportFromEnabled) we'll want to make sure to be consistent with which layer of the raft code those checks are being made. Redundant checks aren't the end of the world, and they can even make sense in some cases, but this code is still going to be tricky to reason about it we have the branching all over the place (e.g. in tick functions and also in message bcast functions and also in message send functions and also in the SupportTracker).

Another note to the group, we should probably extract a (*raft).fortificationEnabled() bool method or something that maps the decision of store liveness being enabled to the decision of the raft fortification protocol being used. That might be a simple mapping, but it's better to be explicit about that in a single location than letting the mapping leak throughout this package.


pkg/raft/raft_paper_test.go line 118 at r4 (raw file):

			} else {
				r = newTestRaft(1, 10, hi,
					newTestMemoryStorage(withPeers(1, 2, 3)), withFortificationDisabled())

nit: any way to share more code between these two branches, to avoid divergence? Can we have the conditional logic just construct a set of []testConfigModifierOpt?

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this, this is pretty close. Mostly minor comments at this point.

Reviewed 9 of 9 files at r6, 4 of 9 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh and @nvanbenschoten)


-- commits line 2 at r6:
nit: "do not try to fortify followers which are not supported in store liveness"


-- commits line 5 at r6:
The indentation here seems off. The commit message is also missing the usual release note/issue template bits.


-- commits line 9 at r7:
"periodically attempt to fortify unfortified followers" is probably more accurate.


-- commits line 11 at r7:
nit: the first two sentences here feel a bit light on detail and don't flow. They're fine for us, given we have context, but they might not serve future readers well. Could you expand on these please?

One suggestion I'd have (from our commit guidelines wiki) is to follow the "what was here before" and "what you changed" pattern.


-- commits line 12 at r7:

Moreover,
the leader now skips sending fortification messages to followers whose
stores don't provide support in the store-liveness-fabric.

Is this still accurate given the prior commit?


pkg/raft/raft.go line 805 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We're now checking this at three different levels. Is there a good way to clean that up? Should it only live in maybeSendFortify?

Resolved.


pkg/raft/raft.go line 1032 at r4 (raw file):

Another note to the group, we should probably extract a (*raft).fortificationEnabled() bool method or something that maps the decision of store liveness being enabled to the decision of the raft fortification protocol being used. That might be a simple mapping, but it's better to be explicit about that in a single location than letting the mapping leak throughout this package.

This makes sense to me. It also allows us to have a single check (as we do right now, in the latest iteration, in maybeSendFortify) and replace all redundant checks with assertions instead (if we want to).


pkg/raft/raft.go line 706 at r6 (raw file):

// maybeSendFortify sends a fortification RPC to the given peer if it doesn't
// support the leader. We skip sending a fortification message if we suspect a

"if it isn't fortified but the peer's store supports the leader's store in StoreLiveness".


pkg/raft/raft.go line 719 at r6 (raw file):

		// leader, or it is but the leader isn't hearing about it, we don't need to
		// send a fortify message. This is because it's likely that there is a
		// problem with the connection, and we will attempt to send the fortify

It's also possible that the follower is experiencing a disk stall. Consider removing any mention of the network here; instead, you could just hand wave by saying "we will attempt to fortify the follower once store liveness support is established".


pkg/raft/raft.go line 1009 at r7 (raw file):

	if r.heartbeatElapsed >= r.heartbeatTimeout {
		r.heartbeatElapsed = 0
		if err := r.Step(pb.Message{From: r.id, Type: pb.MsgBeat}); err != nil {

If I understand correctly, the end goal is for this to only happen if fortification is disabled, right?

If so, sounds like we'll need either a SupportFromEnabled or (*raft).fortificationEnabled check here as well.


pkg/raft/rafttest/interaction_env_handler_add_nodes.go line 189 at r2 (raw file):

Previously, iskettaneh wrote…

I thought of that, but I don't think it works because we actually have multiple add-nodes statements, and we keep the number of voters the same in them. Something like:

add-nodes 1 voters=(1 2 3)
...
add-nodes 2 voters=(1 2 3)

So the second statement would create another 3 store liveness nodes, which we don't want

Is there a way to detect this case as well? For example, if the ID we're assigning in AddNodes is less than the number of nodes in StoreLiveness, we know that storeLivenessNodes = 0.

It's not the end of the world to add this, but it would be nice to avoid it if at all possible.


pkg/raft/rafttest/interaction_env_handler_add_nodes.go line 40 at r6 (raw file):

	// The number of store liveness nodes is the same number of nodes by default.
	// However, it could be explicitly specified using `store-liveness-nodes`

missing period.


pkg/raft/tracker/supporttracker.go line 66 at r6 (raw file):

// IsSuppotedBy returns true if a follower is supporting the leader in the
// provided store liveness epoch.
func (st *SupportTracker) IsSuppotedBy(id pb.PeerID, currentEpoch pb.Epoch) bool {

s/currentEpoch/epoch/g

Separately, do we need to pass the epoch into this function? We already have a handle to store liveness on the SupportTracker -- should we just use that instead?


pkg/raft/tracker/supporttracker.go line 66 at r6 (raw file):

// IsSuppotedBy returns true if a follower is supporting the leader in the
// provided store liveness epoch.
func (st *SupportTracker) IsSuppotedBy(id pb.PeerID, currentEpoch pb.Epoch) bool {

Heh, I think I made a typo in my initial comment that you carried over here -- I should've said "IsSupportedBy"


pkg/raft/testdata/fortification_basic.txt line 124 at r6 (raw file):

  1/3 EntryNormal ""
  Messages:
  3->1 MsgFortifyLeaderResp Term:1 Log:0/0 Rejected (Hint: 0)

With this change, we're losing test coverage of the case where a follower rejects a MsgFortify request because it isn't supporting the leader's store in store liveness. That's because there's no way to express uni-directional support withdrawal in the data-driven DSL.

The scenario where the leader thinks it has support from a follower and sends it a MsgFortify request, but the follower rejects it because it no longer supports the leader in StoreLiveness, should be a rare race in practice. We should still test it though -- maybe a regular test instead of trying to express it using this DSL?


pkg/raft/tracker/supporttracker_test.go line 180 at r6 (raw file):

			storeLiveness: mockLivenessOnePeer,
			setup: func(supportTracker *SupportTracker) {
				// Support recorded for a different follower than the one in the

s/in the/in/


pkg/raft/tracker/supporttracker_test.go line 199 at r6 (raw file):

			storeLiveness: mockLivenessOnePeer,
			setup: func(supportTracker *SupportTracker) {
				// Record support at the same epoch as the storeLiveness.

Should we also add a test case for a higher epoch than store liveness? We should also detail how that can happen and how we're choosing to handle it in IsSupportedBy (see the comment in LeadSupportUntil which describes something similar.

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review! I addressed the comments

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits line 5 at r6:

Previously, arulajmani (Arul Ajmani) wrote…

The indentation here seems off. The commit message is also missing the usual release note/issue template bits.

Done.


-- commits line 9 at r7:

Previously, arulajmani (Arul Ajmani) wrote…

"periodically attempt to fortify unfortified followers" is probably more accurate.

Done.


-- commits line 12 at r7:

Previously, arulajmani (Arul Ajmani) wrote…

Moreover,
the leader now skips sending fortification messages to followers whose
stores don't provide support in the store-liveness-fabric.

Is this still accurate given the prior commit?

Done.


pkg/raft/raft.go line 727 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This isn't checking that the follower is providing active support to the leader's store, just that it at some point was providing support. We need to cross reference the support expiration (the discarded variable) with the leader's clock. To do that, we'll need to use storeLiveness.SupportExpired.

Oh great catch! I was under the impression that if the expiration time has already expired, the boolean would be false. It's good to know that this isn't the case.
I fixed it, and I will also add a test to verify that we do refortify if that happened.


pkg/raft/raft.go line 1032 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Another note to the group, we should probably extract a (*raft).fortificationEnabled() bool method or something that maps the decision of store liveness being enabled to the decision of the raft fortification protocol being used. That might be a simple mapping, but it's better to be explicit about that in a single location than letting the mapping leak throughout this package.

This makes sense to me. It also allows us to have a single check (as we do right now, in the latest iteration, in maybeSendFortify) and replace all redundant checks with assertions instead (if we want to).

Ack.


pkg/raft/raft.go line 706 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

"if it isn't fortified but the peer's store supports the leader's store in StoreLiveness".

Done.


pkg/raft/raft.go line 719 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

It's also possible that the follower is experiencing a disk stall. Consider removing any mention of the network here; instead, you could just hand wave by saying "we will attempt to fortify the follower once store liveness support is established".

Done.


pkg/raft/raft.go line 1009 at r7 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

If I understand correctly, the end goal is for this to only happen if fortification is disabled, right?

If so, sounds like we'll need either a SupportFromEnabled or (*raft).fortificationEnabled check here as well.

That's the end goal yes. However, we can't. reach that goal just yet until we remove all heartbeat dependencies. This should happen very soon though.


pkg/raft/rafttest/interaction_env_handler_add_nodes.go line 189 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Is there a way to detect this case as well? For example, if the ID we're assigning in AddNodes is less than the number of nodes in StoreLiveness, we know that storeLivenessNodes = 0.

It's not the end of the world to add this, but it would be nice to avoid it if at all possible.

I managed to remove the explicit store nodes argument


pkg/raft/rafttest/interaction_env_handler_add_nodes.go line 40 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

missing period.

Done.


pkg/raft/tracker/supporttracker.go line 66 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

s/currentEpoch/epoch/g

Separately, do we need to pass the epoch into this function? We already have a handle to store liveness on the SupportTracker -- should we just use that instead?

Done the renaming.

I wanted to avoid having the supporTracker call another SupportFrom(). But I can change that if you think its important.


pkg/raft/tracker/supporttracker.go line 66 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Heh, I think I made a typo in my initial comment that you carried over here -- I should've said "IsSupportedBy"

Oh yeah it made it all the way through. I just changed that.


pkg/raft/testdata/fortification_basic.txt line 124 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

With this change, we're losing test coverage of the case where a follower rejects a MsgFortify request because it isn't supporting the leader's store in store liveness. That's because there's no way to express uni-directional support withdrawal in the data-driven DSL.

The scenario where the leader thinks it has support from a follower and sends it a MsgFortify request, but the follower rejects it because it no longer supports the leader in StoreLiveness, should be a rare race in practice. We should still test it though -- maybe a regular test instead of trying to express it using this DSL?

Good point. I added a fourth node to exercise all basic code paths: (1) The follower accepts fortification, (2) The follower rejects fortification, (3) The leader doesn't attempt fortification. Let me know if that complicated the test too much in your opinion.


pkg/raft/tracker/supporttracker_test.go line 180 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

s/in the/in/

Done.


pkg/raft/tracker/supporttracker_test.go line 199 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we also add a test case for a higher epoch than store liveness? We should also detail how that can happen and how we're choosing to handle it in IsSupportedBy (see the comment in LeadSupportUntil which describes something similar.

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 9 files at r6, 13 of 13 files at r8, 2 of 9 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @iskettaneh)


pkg/raft/raft.go line 713 at r8 (raw file):

		return
	}
	curEpoch, _, ok := r.storeLiveness.SupportFrom(id)

This is another case where we are assuming that ok means that support is currently provided. @miraradeva is switching that in #131113, so we could just wait for that instead of making a change.


pkg/raft/raft.go line 729 at r8 (raw file):

	}

	if !r.supportTracker.IsSupportedBy(id, curEpoch) {

I've found myself getting tripped up by the different meanings of "support" enough times that I think we should probably try to clean this up. We use "support" in all of the following ways:

  • store liveness support for an epoch
  • store liveness support for an epoch that is current
  • raft fortification support for a term
  • raft fortification support for a (term, epoch) pair
  • raft fortification support for a (term, epoch) pair that has an store liveness expiration that is current
  • raft fortification support for from a quorum of peers whose quorum supported expiration is current ("lead support")

One suggestion would be to stop referring to "fortification" as "support". A follower "fortifies" a leader, it doesn't provide "fortifying support" to the leader. We could rename "SupportTracker" to "FortificationTracker", rename "RecordSupport" to "RecordFortification", and rename "IsSupportedBy" to "IsFortifiedBy".

This keeps the entire fortification protocol (which operates on terms and epochs, but never timestamps) clear of the term "support". That means that whenever we are referring to "support", we are talking about "support in real time".

We would keep "LeadSupportUntil", as "lead support" is derived from a join between store liveness support and leader fortification.

What do you think @arulajmani and @iskettaneh?


pkg/raft/raft.go line 783 at r8 (raw file):

}

// maybeBcastFortify attempts to send an RPC to fortify the leader to all the

IMO this doesn't need a maybe prefix, because each condition ("maybe") is evaluated independently on each peer. There's not a single, top-level condition. That matches the relationship between bcastAppend and maybeSendAppend.


pkg/raft/tracker/supporttracker.go line 69 at r8 (raw file):

	supportEpoch, exist := st.support[id]
	if !exist {
		// we don't know that the follower supports us in the epoch.

"We"

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, and @nvanbenschoten)


pkg/raft/raft.go line 713 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is another case where we are assuming that ok means that support is currently provided. @miraradeva is switching that in #131113, so we could just wait for that instead of making a change.

Yes, I reverted this logic back to this and will basically wait for Mira's change to land first.


pkg/raft/raft.go line 729 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I've found myself getting tripped up by the different meanings of "support" enough times that I think we should probably try to clean this up. We use "support" in all of the following ways:

  • store liveness support for an epoch
  • store liveness support for an epoch that is current
  • raft fortification support for a term
  • raft fortification support for a (term, epoch) pair
  • raft fortification support for a (term, epoch) pair that has an store liveness expiration that is current
  • raft fortification support for from a quorum of peers whose quorum supported expiration is current ("lead support")

One suggestion would be to stop referring to "fortification" as "support". A follower "fortifies" a leader, it doesn't provide "fortifying support" to the leader. We could rename "SupportTracker" to "FortificationTracker", rename "RecordSupport" to "RecordFortification", and rename "IsSupportedBy" to "IsFortifiedBy".

This keeps the entire fortification protocol (which operates on terms and epochs, but never timestamps) clear of the term "support". That means that whenever we are referring to "support", we are talking about "support in real time".

We would keep "LeadSupportUntil", as "lead support" is derived from a join between store liveness support and leader fortification.

What do you think @arulajmani and @iskettaneh?

That makes sense to me. What do you think @arulajmani


pkg/raft/raft.go line 783 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

IMO this doesn't need a maybe prefix, because each condition ("maybe") is evaluated independently on each peer. There's not a single, top-level condition. That matches the relationship between bcastAppend and maybeSendAppend.

Makes sense. Done

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r8, 9 of 9 files at r10, 8 of 8 files at r11, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh, @miraradeva, and @nvanbenschoten)


pkg/raft/raft.go line 1009 at r7 (raw file):

Previously, iskettaneh wrote…

That's the end goal yes. However, we can't. reach that goal just yet until we remove all heartbeat dependencies. This should happen very soon though.

Ack. So is the plan to wait until a second use of SupportFromEnabled is introduced to introduce (*raft).fortificationEnabled, right?


pkg/raft/raft.go line 729 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I've found myself getting tripped up by the different meanings of "support" enough times that I think we should probably try to clean this up. We use "support" in all of the following ways:

  • store liveness support for an epoch
  • store liveness support for an epoch that is current
  • raft fortification support for a term
  • raft fortification support for a (term, epoch) pair
  • raft fortification support for a (term, epoch) pair that has an store liveness expiration that is current
  • raft fortification support for from a quorum of peers whose quorum supported expiration is current ("lead support")

One suggestion would be to stop referring to "fortification" as "support". A follower "fortifies" a leader, it doesn't provide "fortifying support" to the leader. We could rename "SupportTracker" to "FortificationTracker", rename "RecordSupport" to "RecordFortification", and rename "IsSupportedBy" to "IsFortifiedBy".

This keeps the entire fortification protocol (which operates on terms and epochs, but never timestamps) clear of the term "support". That means that whenever we are referring to "support", we are talking about "support in real time".

We would keep "LeadSupportUntil", as "lead support" is derived from a join between store liveness support and leader fortification.

What do you think @arulajmani and @iskettaneh?

This sounds good to me. I'm not sure if others have, but I've definitely been saying "fortification support" a lot to describe what we're doing here -- partially to blame for the confusion here.

@iskettaneh if you're good with this too, maybe we can prefix this PR with a commit to do this rename (or pull it out into a separate PR which we can land independently). There's a few places in the code/test files where I'm saying "fortification support" -- might be good to reword those as well.


pkg/raft/rafttest/interaction_env_handler_add_nodes.go line 189 at r2 (raw file):

Previously, iskettaneh wrote…

I managed to remove the explicit store nodes argument

Nice!


pkg/raft/tracker/supporttracker.go line 66 at r6 (raw file):

Previously, iskettaneh wrote…

Done the renaming.

I wanted to avoid having the supporTracker call another SupportFrom(). But I can change that if you think its important.

It's not great that the SupportTracker has access to StoreLiveness, yet it's being passed an epoch by the caller. If you don't want to duplicate the call to SupportFrom (which I agree, isn't great), maybe we can have the SupportTracker be the only one to call SupportFrom. In that case, we'd have it return a boolean to the caller indicating whether the follower's store is currently supported in StoreLiveness or not, based on which we can decide whether to send a re-fortification message or not.


pkg/raft/testdata/fortification_basic.txt line 124 at r6 (raw file):

Previously, iskettaneh wrote…

Good point. I added a fourth node to exercise all basic code paths: (1) The follower accepts fortification, (2) The follower rejects fortification, (3) The leader doesn't attempt fortification. Let me know if that complicated the test too much in your opinion.

Your approach was neat, much better than my suggestion! Nice 💯


pkg/raft/tracker/supporttracker_test.go line 202 at r8 (raw file):

				// StoreLiveness.
				//
				// NB: This is possible if there is a race between store liveness

Should we also add this commentary in IsSupportedBy to mirror LeadSupportUntil?

@iskettaneh iskettaneh requested a review from a team as a code owner September 24, 2024 00:10
@iskettaneh iskettaneh requested review from kev-cao and removed request for a team September 24, 2024 00:10
Copy link

blathers-crl bot commented Sep 24, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @kev-cao, @miraradeva, and @nvanbenschoten)


pkg/raft/raft.go line 1009 at r7 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Ack. So is the plan to wait until a second use of SupportFromEnabled is introduced to introduce (*raft).fortificationEnabled, right?

yes


pkg/raft/testdata/fortification_basic.txt line 124 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Your approach was neat, much better than my suggestion! Nice 💯

Done.


pkg/raft/tracker/supporttracker.go line 66 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

It's not great that the SupportTracker has access to StoreLiveness, yet it's being passed an epoch by the caller. If you don't want to duplicate the call to SupportFrom (which I agree, isn't great), maybe we can have the SupportTracker be the only one to call SupportFrom. In that case, we'd have it return a boolean to the caller indicating whether the follower's store is currently supported in StoreLiveness or not, based on which we can decide whether to send a re-fortification message or not.

Done.


pkg/raft/tracker/supporttracker_test.go line 202 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we also add this commentary in IsSupportedBy to mirror LeadSupportUntil?

Done.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good to go after this round of comments is addressed!

Reviewed 16 of 16 files at r12, 7 of 7 files at r13, 8 of 8 files at r14, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh, @kev-cao, @miraradeva, and @nvanbenschoten)


pkg/raft/raft.go line 1279 at r12 (raw file):

			if IsMsgFromLeader(m.Type) {
				// We've just received a message from the new leader which was elected
				// at a higher term. The old leader's fortification has expired, so it's

nit: "the old leader is no longer fortified"


pkg/raft/raft.go line 714 at r13 (raw file):

	}

	isSupported, isFortified := r.fortificationTracker.IsFortifiedby(id)

nit: we could just check isFortified and early return if the follower is already fortified.


pkg/raft/rawnode.go line 552 at r12 (raw file):

}

func (rn *RawNode) TestingSupportStateString() string {

Should we rename this to TestingFortificationStateString as well?

Same comment for handlePrintSupportState and the datadriven directive which says print-support-state -- we should switch all of those to use the fortification verbiage as well.

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @kev-cao, @miraradeva, and @nvanbenschoten)


pkg/raft/rawnode.go line 552 at r12 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we rename this to TestingFortificationStateString as well?

Same comment for handlePrintSupportState and the datadriven directive which says print-support-state -- we should switch all of those to use the fortification verbiage as well.

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 16 files at r15, 3 of 7 files at r16, 2 of 8 files at r17, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @iskettaneh, @kev-cao, and @miraradeva)


pkg/raft/raft.go line 716 at r16 (raw file):

	isSupported, isFortified := r.fortificationTracker.IsFortifiedBy(id)

	// Exist early if the follower is already fortified.

"Exit"

But also, do we need this early return? It doesn't seem to save much work, and it makes the control flow less intuitive.


pkg/raft/tracker/fortificationtracker.go line 66 at r16 (raw file):

// IsFortifiedBy returns whether the follower fortifies the leader or not.
// If the follower's store doesn't support the leader's store in the store
// liveness fabric, then both isSupported and isFortified will be false.

isFortified implies isSupported, right? If so, could we mention that?


pkg/raft/tracker/fortificationtracker.go line 68 at r16 (raw file):

// liveness fabric, then both isSupported and isFortified will be false.
func (st *FortificationTracker) IsFortifiedBy(id pb.PeerID) (isSupported bool, isFortified bool) {
	supportEpoch, _, ok := st.storeLiveness.SupportFrom(id)

We'll need to update this when we rebase on #131113.


pkg/raft/tracker/fortificationtracker.go line 70 at r16 (raw file):

	supportEpoch, _, ok := st.storeLiveness.SupportFrom(id)
	if !ok {
		return isSupported, isFortified

Returning the named variables is generally an anti-pattern in cases where it can be avoided. It makes it more difficult to determine what is being returned. Prefer returning literals where possible. So return false, false here, and consider following that pattern below.


pkg/raft/tracker/fortificationtracker.go line 83 at r16 (raw file):

	}

	// NB: We can't assert that supportEpoch <= curEpoch because there may be a

s/curEpoch/fortificationEpoch/

Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @kev-cao, @miraradeva, and @nvanbenschoten)


pkg/raft/raft.go line 716 at r16 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"Exit"

But also, do we need this early return? It doesn't seem to save much work, and it makes the control flow less intuitive.

@arulajmani Do you have a strong preference to keep the early exit?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 7 files at r16, 16 of 16 files at r18, 13 of 13 files at r19, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @kev-cao, and @miraradeva)


pkg/raft/raft.go line 716 at r16 (raw file):

Previously, iskettaneh wrote…

@arulajmani Do you have a strong preference to keep the early exit?

I don't have a strong preference, so no need to block on this comment.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 7 files at r16, 16 of 16 files at r18, 13 of 13 files at r19, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh, @kev-cao, @miraradeva, and @nvanbenschoten)


pkg/raft/raft.go line 729 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

This sounds good to me. I'm not sure if others have, but I've definitely been saying "fortification support" a lot to describe what we're doing here -- partially to blame for the confusion here.

@iskettaneh if you're good with this too, maybe we can prefix this PR with a commit to do this rename (or pull it out into a separate PR which we can land independently). There's a few places in the code/test files where I'm saying "fortification support" -- might be good to reword those as well.

Resolving given we landed this already.


pkg/raft/raft.go line 716 at r16 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't have a strong preference, so no need to block on this comment.

I suggested this to buy some indentation below.

nit: you could "return // early return if the follower's fortified" instead of the comment above the if block, if you like.


pkg/raft/raft.go line 765 at r18 (raw file):

	}

	if !isFortified {

We no longer need this now that we have the early return, right?


pkg/raft/tracker/fortificationtracker.go line 68 at r18 (raw file):

// liveness fabric, then both isSupported and isFortified will be false.
// If isFortified is true, it implies that isSupported is also true.
func (st *FortificationTracker) IsFortifiedBy(id pb.PeerID) (isSupported bool, isFortified bool) {

Given the function is called IsFortifiedBy, I'd have expected the first boolean returned to be isFortified. The order of these return values threw me off a bit -- how do you feel about switching them around?


pkg/raft/tracker/fortificationtracker.go line 87 at r18 (raw file):

	// is supporting the leader's store at the epoch in the MsgFortifyLeaderResp
	// message.
	if fortificationEpoch == supportEpoch {

nit: can we simplify this to return true, fortificationEpoch == supportEpoch (flipped, if you take the suggestion above).

This commit makes the leader skip trying to fortify followers which are
not supported in store liveness.
Copy link
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @kev-cao, @miraradeva, and @nvanbenschoten)


pkg/raft/raft.go line 765 at r18 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

We no longer need this now that we have the early return, right?

Done.


pkg/raft/tracker/fortificationtracker.go line 68 at r18 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Given the function is called IsFortifiedBy, I'd have expected the first boolean returned to be isFortified. The order of these return values threw me off a bit -- how do you feel about switching them around?

Done.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for working through the multiple rounds of reviews on this one. Let's get it shipped 🚀

Reviewed 10 of 10 files at r20, 8 of 8 files at r21, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @iskettaneh, @kev-cao, @miraradeva, and @nvanbenschoten)


pkg/raft/raft.go line 746 at r20 (raw file):

	if isFortified {
		return // return early if the follower's fortified.

nit: typically, with these inline comments, we omit the period (and don't start the sentence with a capital letter either, like you've done). So this would be return // early return if the follower's fortified

This commit makes the leader periodically attempt to fortify unfortified
followers.

Before this change, the leader used to send a fortification message to
followers, but doesn't try to refortify unfortified followers.
@iskettaneh
Copy link
Contributor Author

bors r+

@craig craig bot merged commit 727fc55 into cockroachdb:master Sep 27, 2024
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

raft: add abbility to re-fortify
4 participants