-
Notifications
You must be signed in to change notification settings - Fork 575
Add per-client queue length metrics #2754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It's a metric that tracks the size of the incoming per-client `message_queue`. We've been concerned about not having visibility into this queue's length, as currently we only have the length of the per-database reducer queue, but each client can only have a single reducer in that queue, and may have additional messages waiting in its per-client queue. The new metric, `spacetime_client_connection_incoming_queue_length`, is an `IntGaugeVec` with the labels: `db: Identity, client_identity: Identity, connection_id: ConnectionId`. My theory is that in our viewer we can inspect the average and the sum per database, and it also may be interesting to be able to look at individual clients, as e.g. the BitCraft mob monitor may be a notable outlier. This is not the same pattern as most of our other metrics, though, which tend to only offer per-database granularity. It's possible that this new metric should also have the last two labels removed, and be labeled only on `db: Identity`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are indeed the labels that we want for right now.
scopeguard::defer!( | ||
if let Err(e) = WORKER_METRICS | ||
.client_connection_incoming_queue_length | ||
.remove_label_values(&addr, &client_identity, &connection_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In lieu of a test, I'd just add a comment here about resetting the gauge when the connection ends since that's something that has plagued us before.
|
||
#[name = spacetime_client_connection_incoming_queue_length] | ||
#[help = "The number of client -> server WebSocket messages waiting in a client connection's incoming queue"] | ||
#[labels(db: Identity, client_identity: Identity, connection_id: ConnectionId)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid adding any metrics with per-client or per-connection labels. I'm not sure how big the impact would be on our metrics collection/storage/querying performance, but the cardinality of a stat like this can blow up very quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed out-of-band, I'll alter this to be a per-database metric, which I suppose will be the sum of the length of all clients' queues. I expect to open a separate PR for that, since we want to use this PR in testing.
Like the previous metric added in this branch, it's per-client, so we'll use it for testing, but likely not merge it into master. I'll follow up in a separate PR with a version that's per-database instead.
client_connection_incoming_queue_length
metricThis commit alters the message queue length metrics introduced by #2754 to be per-database, rather than per-client. This should limit the cardinality of these metrics, and better lines up with the labels of our other metrics. Because the metrics are now per-database rather than per-client, it's no longer correct to just drop the label when the client disconnects. Instead, care must be taken to decrement the metric by the number of messages which were waiting in the queue at the time of the disconnection. I've added comments to call attention to this complexity.
Included resolving merge conflicts in client_connection.rs, moving the sendtx queue length metric into the `ClientConnectionMetrics`.
Description of Changes
This PR adds two new metrics,
spacetime_client_connection_incoming_queue_length
andspacetime_client_connection_outgoing_queue_length
, which track the lengths of the per-client incoming and outgoing message queues. We expect to use these metrics in testing, incl. bot testing, but not to merge them, as we expect the cardinality of per-client metrics to be too high. I will open a separate PR with per-database versions of these metrics which we can merge.