-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storeliveness: add transport metrics #131025
Conversation
ea6644c
to
5e3ed64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 8 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @miraradeva)
-- commits
line 31 at r2:
"blocked"
pkg/kv/kvserver/store.go
line 2194 at r2 (raw file):
options := storeliveness.NewOptions(heartbeatInterval, livenessInterval, supportGracePeriod) sm := storeliveness.NewSupportManager( slpb.StoreIdent{NodeID: s.nodeDesc.NodeID, StoreID: s.StoreID()}, s.StateEngine(), options,
Is it surprising that this didn't cause any tests to fail? Was store liveness completely broken without this fix?
pkg/kv/kvserver/storeliveness/metrics.go
line 74 at r1 (raw file):
} metaMessagesDropped = metric.Metadata{ Name: "storeliveness.transport.dropped",
Should we have separate metrics for send_dropped
and received_dropped
? We don't yet drop on receive, but once we do, it will be helpful to distinguish the two cases. EDIT: we do not drop on receive, as of commit 2. All the more reason to make this change here.
pkg/kv/kvserver/storeliveness/support_manager.go
line 372 at r2 (raw file):
// Drop messages if maxReceiveQueueSize is reached. if len(q.mu.msgs) >= maxReceiveQueueSize { return errors.Errorf("store liveness receive queue is full")
We should pre-allocate this error as a static variable to avoid the cost on each dropped message. var receiveQueueSizeLimitReachedErr = ...
pkg/kv/kvserver/storeliveness/transport.go
line 190 at r2 (raw file):
t.metrics.MessagesDropped.Inc(1) } t.metrics.MessagesReceived.Inc(1)
Do we want to increment "received" if "dropped" is incremented?
5e3ed64
to
d7ea27d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"blocked"
Done
pkg/kv/kvserver/store.go
line 2194 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is it surprising that this didn't cause any tests to fail? Was store liveness completely broken without this fix?
No tests failed because I haven't added any tests with a real store. It should be caught by the end-to-end tests that I'll add soon. I spotted it in the logs while running a local cluster. Store liveness was running, and all stores were 0; I think it would have definitely caused issues if there were multiple stores per node.
pkg/kv/kvserver/storeliveness/metrics.go
line 74 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we have separate metrics for
send_dropped
andreceived_dropped
? We don't yet drop on receive, but once we do, it will be helpful to distinguish the two cases. EDIT: we do not drop on receive, as of commit 2. All the more reason to make this change here.
Yes, good idea. I can also add metrics for the receiver queue, similar to the sender queue. Or leave it as a followup.
pkg/kv/kvserver/storeliveness/support_manager.go
line 372 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We should pre-allocate this error as a static variable to avoid the cost on each dropped message.
var receiveQueueSizeLimitReachedErr = ...
Done
pkg/kv/kvserver/storeliveness/transport.go
line 190 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we want to increment "received" if "dropped" is incremented?
Nope, good catch. I added a test too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)
This commit adds seven new metrics for the Store Liveness Transport: - `storeliveness.transport.send-queue-size`: Number of pending outgoing messages in all Store Liveness Transport per-store send queues. - `storeliveness.transport.send-queue-bytes`: Total byte size of pending outgoing messages in all Store Liveness Transport per-store send queues. - `storeliveness.transport.send-queue-idle`: Number of Store Liveness Transport per-store send queues that have become idle due to no recently-sent messages. - `storeliveness.transport.sent`: Number of Store Liveness messages sent by the Store Liveness Transport. - `storeliveness.transport.received`: Number of Store Liveness messages received by the Store Liveness Transport. - `storeliveness.transport.send_dropped`: Number of Store Liveness messages dropped by the Store Liveness Transport on the sender side. - `storeliveness.transport.receive_dropped`: Number of Store Liveness messages dropped by the Store Liveness Transport on the receiver side. Part of: cockroachdb#125067 Release note: None
Previously, the SupportManager's receive queue could grown unbounded if incoming messages were not processed fast enough. This can happen in the case of a disk stall when the SupportManager's processing goroutine is blocked on a write and not handling any new incoming messages. This commit adds a maximum size to the receive queue. After the maximum size is reached, handling further incoming messages returns an error, which is interpreted as a dropped message in the transport. Part of: cockroachdb#125063 Release note: None
d7ea27d
to
69ac6a4
Compare
The most recent push deflakes the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)
TFTR! bors r=nvanbenschoten |
This commit adds seven new metrics for the Store Liveness Transport:
storeliveness.transport.send-queue-size
: Number of pending outgoing messages in all Store Liveness Transport per-store send queues.storeliveness.transport.send-queue-bytes
: Total byte size of pending outgoing messages in all Store Liveness Transport per-store send queues.storeliveness.transport.send-queue-idle
: Number of Store Liveness Transport per-store send queues that have become idle due to no recently-sent messages.storeliveness.transport.sent
: Number of Store Liveness messages sent by the Store Liveness Transport.storeliveness.transport.received
: Number of Store Liveness messages received by the Store Liveness Transport.storeliveness.transport.send_dropped
: Number of Store Liveness messages dropped by the Store Liveness Transport on the sender side.storeliveness.transport.receive_dropped
: Number of Store Liveness messages dropped by the Store Liveness Transport on the receiver side.Part of: #125067
Release note: None