Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storeliveness: add transport metrics #131025

Merged
merged 2 commits into from
Sep 25, 2024

Conversation

miraradeva
Copy link
Contributor

@miraradeva miraradeva commented Sep 19, 2024

This commit adds seven new metrics for the Store Liveness Transport:

  • storeliveness.transport.send-queue-size: Number of pending outgoing messages in all Store Liveness Transport per-store send queues.
  • storeliveness.transport.send-queue-bytes: Total byte size of pending outgoing messages in all Store Liveness Transport per-store send queues.
  • storeliveness.transport.send-queue-idle: Number of Store Liveness Transport per-store send queues that have become idle due to no recently-sent messages.
  • storeliveness.transport.sent: Number of Store Liveness messages sent by the Store Liveness Transport.
  • storeliveness.transport.received: Number of Store Liveness messages received by the Store Liveness Transport.
  • storeliveness.transport.send_dropped: Number of Store Liveness messages dropped by the Store Liveness Transport on the sender side.
  • storeliveness.transport.receive_dropped: Number of Store Liveness messages dropped by the Store Liveness Transport on the receiver side.

Part of: #125067

Release note: None


Screenshot 2024-09-20 at 1 20 18 PM

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miraradeva miraradeva force-pushed the mira-125067-transport branch 4 times, most recently from ea6644c to 5e3ed64 Compare September 20, 2024 16:48
@miraradeva miraradeva marked this pull request as ready for review September 20, 2024 17:36
@miraradeva miraradeva requested a review from a team as a code owner September 20, 2024 17:36
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 8 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @miraradeva)


-- commits line 31 at r2:
"blocked"


pkg/kv/kvserver/store.go line 2194 at r2 (raw file):

	options := storeliveness.NewOptions(heartbeatInterval, livenessInterval, supportGracePeriod)
	sm := storeliveness.NewSupportManager(
		slpb.StoreIdent{NodeID: s.nodeDesc.NodeID, StoreID: s.StoreID()}, s.StateEngine(), options,

Is it surprising that this didn't cause any tests to fail? Was store liveness completely broken without this fix?


pkg/kv/kvserver/storeliveness/metrics.go line 74 at r1 (raw file):

	}
	metaMessagesDropped = metric.Metadata{
		Name: "storeliveness.transport.dropped",

Should we have separate metrics for send_dropped and received_dropped? We don't yet drop on receive, but once we do, it will be helpful to distinguish the two cases. EDIT: we do not drop on receive, as of commit 2. All the more reason to make this change here.


pkg/kv/kvserver/storeliveness/support_manager.go line 372 at r2 (raw file):

	// Drop messages if maxReceiveQueueSize is reached.
	if len(q.mu.msgs) >= maxReceiveQueueSize {
		return errors.Errorf("store liveness receive queue is full")

We should pre-allocate this error as a static variable to avoid the cost on each dropped message. var receiveQueueSizeLimitReachedErr = ...


pkg/kv/kvserver/storeliveness/transport.go line 190 at r2 (raw file):

		t.metrics.MessagesDropped.Inc(1)
	}
	t.metrics.MessagesReceived.Inc(1)

Do we want to increment "received" if "dropped" is incremented?

Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits line 31 at r2:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"blocked"

Done


pkg/kv/kvserver/store.go line 2194 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it surprising that this didn't cause any tests to fail? Was store liveness completely broken without this fix?

No tests failed because I haven't added any tests with a real store. It should be caught by the end-to-end tests that I'll add soon. I spotted it in the logs while running a local cluster. Store liveness was running, and all stores were 0; I think it would have definitely caused issues if there were multiple stores per node.


pkg/kv/kvserver/storeliveness/metrics.go line 74 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we have separate metrics for send_dropped and received_dropped? We don't yet drop on receive, but once we do, it will be helpful to distinguish the two cases. EDIT: we do not drop on receive, as of commit 2. All the more reason to make this change here.

Yes, good idea. I can also add metrics for the receiver queue, similar to the sender queue. Or leave it as a followup.


pkg/kv/kvserver/storeliveness/support_manager.go line 372 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should pre-allocate this error as a static variable to avoid the cost on each dropped message. var receiveQueueSizeLimitReachedErr = ...

Done


pkg/kv/kvserver/storeliveness/transport.go line 190 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we want to increment "received" if "dropped" is incremented?

Nope, good catch. I added a test too.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 7 of 7 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)

This commit adds seven new metrics for the Store Liveness Transport:

- `storeliveness.transport.send-queue-size`: Number of pending outgoing
  messages in all Store Liveness Transport per-store send queues.
- `storeliveness.transport.send-queue-bytes`: Total byte size of pending
  outgoing messages in all Store Liveness Transport per-store send
  queues.
- `storeliveness.transport.send-queue-idle`: Number of Store Liveness
  Transport per-store send queues that have become idle due to no
  recently-sent messages.
- `storeliveness.transport.sent`: Number of Store Liveness messages sent
  by the Store Liveness Transport.
- `storeliveness.transport.received`: Number of Store Liveness messages
  received by the Store Liveness Transport.
- `storeliveness.transport.send_dropped`: Number of Store Liveness messages
  dropped by the Store Liveness Transport on the sender side.
- `storeliveness.transport.receive_dropped`: Number of Store Liveness messages
  dropped by the Store Liveness Transport on the receiver side.

Part of: cockroachdb#125067

Release note: None
Previously, the SupportManager's receive queue could grown unbounded if
incoming messages were not processed fast enough. This can happen in
the case of a disk stall when the SupportManager's processing goroutine
is blocked on a write and not handling any new incoming messages.

This commit adds a maximum size to the receive queue. After the maximum
size is reached, handling further incoming messages returns an error,
which is interpreted as a dropped message in the transport.

Part of: cockroachdb#125063

Release note: None
@miraradeva
Copy link
Contributor Author

The most recent push deflakes the TestTransportRestartedNode test (some of the expected metrics were not always there in the presence of various expected races), and fixes the formatting of the TestTransportFullReceiveQueue.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: when CI is green.

Reviewed 5 of 5 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)

@miraradeva
Copy link
Contributor Author

TFTR!

bors r=nvanbenschoten

@craig craig bot merged commit 611d44d into cockroachdb:master Sep 25, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants